group_by/summarize: Not just for numeric values!

tidyverse

Learn some more about the many uses of group_by()/summarize().

~
2018-05-24

Even though I’ve been using the tidyverse for a couple of years, there’s always a couple new applications of tidyverse verbs.

This one, in retrospect, is pretty simple. I had a one to many table that I wanted to collapse, tidy-style. Let’s look at the diamonds dataset:

diamonds %>% select(color, cut) %>%
  head() %>%
  knitr::kable()
color cut
E Ideal
E Premium
E Good
I Premium
J Good
J Very Good

What if we wanted to collapse all the entries for each color into a single line? There’s 7 different colors, so we can use a combination of group_by on color and use the paste() function within summarize() to get what we want, which I’ve called all_colors here. By specifying the collapse argument, we can specify the delimiter within that column:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              paste(cut, collapse=";"))
# A tibble: 7 x 2
  color all_colors                                                    
* <ord> <chr>                                                         
1 D     Very Good;Very Good;Very Good;Good;Good;Premium;Premium;Ideal~
2 E     Ideal;Premium;Good;Fair;Premium;Premium;Very Good;Very Good;V~
3 F     Premium;Very Good;Very Good;Very Good;Good;Premium;Very Good;~
4 G     Very Good;Ideal;Ideal;Very Good;Premium;Premium;Ideal;Very Go~
5 H     Very Good;Very Good;Very Good;Good;Good;Very Good;Good;Very G~
6 I     Premium;Very Good;Ideal;Good;Premium;Ideal;Ideal;Ideal;Ideal;~
7 J     Good;Very Good;Good;Ideal;Ideal;Good;Good;Very Good;Very Good~

Thanks to Ken Butler, who pointed out that the tidyverse way (via stringr) is to use str_c instead:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              stringr::str_c(cut, collapse=";")) 
# A tibble: 7 x 2
  color all_colors                                                    
* <ord> <chr>                                                         
1 D     Very Good;Very Good;Very Good;Good;Good;Premium;Premium;Ideal~
2 E     Ideal;Premium;Good;Fair;Premium;Premium;Very Good;Very Good;V~
3 F     Premium;Very Good;Very Good;Very Good;Good;Premium;Very Good;~
4 G     Very Good;Ideal;Ideal;Very Good;Premium;Premium;Ideal;Very Go~
5 H     Very Good;Very Good;Very Good;Good;Good;Very Good;Good;Very G~
6 I     Premium;Very Good;Ideal;Good;Premium;Ideal;Ideal;Ideal;Ideal;~
7 J     Good;Very Good;Good;Ideal;Ideal;Good;Good;Very Good;Very Good~

Finally, if we wanted to just get the unique values of the cuts in a single line, we can use unique:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              paste(unique(cut), collapse=";")) 
# A tibble: 7 x 2
  color all_colors                       
* <ord> <chr>                            
1 D     Very Good;Good;Premium;Ideal;Fair
2 E     Ideal;Premium;Good;Fair;Very Good
3 F     Premium;Very Good;Good;Fair;Ideal
4 G     Very Good;Ideal;Premium;Good;Fair
5 H     Very Good;Good;Premium;Fair;Ideal
6 I     Premium;Very Good;Ideal;Good;Fair
7 J     Good;Very Good;Ideal;Premium;Fair