group_by/summarize: Not just for numeric values!

Even though I’ve been using the tidyverse for a couple of years, there’s always a couple new applications of tidyverse verbs.

This one, in retrospect, is pretty simple. I had a one to many table that I wanted to collapse, tidy-style. Let’s look at the diamonds dataset:

diamonds %>% select(color, cut)
## # A tibble: 53,940 x 2
##    color cut      
##    <ord> <ord>    
##  1 E     Ideal    
##  2 E     Premium  
##  3 E     Good     
##  4 I     Premium  
##  5 J     Good     
##  6 J     Very Good
##  7 I     Very Good
##  8 H     Very Good
##  9 E     Fair     
## 10 H     Very Good
## # ... with 53,930 more rows

What if we wanted to collapse all the entries for each color into a single line? There’s 7 different colors, so we can use a combination of group_by on color and use the paste() function within summarize() to get what we want, which I’ve called all_colors here. By specifying the collapse argument, we can specify the delimiter within that column:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              paste(cut, collapse=";"))
## # A tibble: 7 x 2
##   color all_colors                                                        
##   <ord> <chr>                                                             
## 1 D     Very Good;Very Good;Very Good;Good;Good;Premium;Premium;Ideal;Ide…
## 2 E     Ideal;Premium;Good;Fair;Premium;Premium;Very Good;Very Good;Very …
## 3 F     Premium;Very Good;Very Good;Very Good;Good;Premium;Very Good;Very…
## 4 G     Very Good;Ideal;Ideal;Very Good;Premium;Premium;Ideal;Very Good;I…
## 5 H     Very Good;Very Good;Very Good;Good;Good;Very Good;Good;Very Good;…
## 6 I     Premium;Very Good;Ideal;Good;Premium;Ideal;Ideal;Ideal;Ideal;Very…
## 7 J     Good;Very Good;Good;Ideal;Ideal;Good;Good;Very Good;Very Good;Ver…

Thanks to Ken Butler, who pointed out that the tidyverse way (via stringr) is to use str_c instead:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              stringr::str_c(cut, collapse=";"))
## # A tibble: 7 x 2
##   color all_colors                                                        
##   <ord> <chr>                                                             
## 1 D     Very Good;Very Good;Very Good;Good;Good;Premium;Premium;Ideal;Ide…
## 2 E     Ideal;Premium;Good;Fair;Premium;Premium;Very Good;Very Good;Very …
## 3 F     Premium;Very Good;Very Good;Very Good;Good;Premium;Very Good;Very…
## 4 G     Very Good;Ideal;Ideal;Very Good;Premium;Premium;Ideal;Very Good;I…
## 5 H     Very Good;Very Good;Very Good;Good;Good;Very Good;Good;Very Good;…
## 6 I     Premium;Very Good;Ideal;Good;Premium;Ideal;Ideal;Ideal;Ideal;Very…
## 7 J     Good;Very Good;Good;Ideal;Ideal;Good;Good;Very Good;Very Good;Ver…

Finally, if we wanted to just get the unique values of the cuts in a single line, we can use unique:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              paste(unique(cut), collapse=";"))
## # A tibble: 7 x 2
##   color all_colors                       
##   <ord> <chr>                            
## 1 D     Very Good;Good;Premium;Ideal;Fair
## 2 E     Ideal;Premium;Good;Fair;Very Good
## 3 F     Premium;Very Good;Good;Fair;Ideal
## 4 G     Very Good;Ideal;Premium;Good;Fair
## 5 H     Very Good;Good;Premium;Fair;Ideal
## 6 I     Premium;Very Good;Ideal;Good;Fair
## 7 J     Good;Very Good;Ideal;Premium;Fair