group_by/summarize: Not just for numeric values!

Learn some more about the many uses of group_by()/summarize().

tidyverse
Published

May 24, 2018

Even though I’ve been using the tidyverse for a couple of years, there’s always a couple new applications of tidyverse verbs.

This one, in retrospect, is pretty simple. I had a one to many table that I wanted to collapse, tidy-style. Let’s look at the diamonds dataset:

diamonds %>% select(color, cut) %>%
  head() %>%
  knitr::kable()
color cut
E Ideal
E Premium
E Good
I Premium
J Good
J Very Good

What if we wanted to collapse all the entries for each color into a single line? There’s 7 different colors, so we can use a combination of group_by on color and use the paste() function within summarize() to get what we want, which I’ve called all_colors here. By specifying the collapse argument, we can specify the delimiter within that column:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              paste(cut, collapse=";"))
# A tibble: 7 × 2
  color all_colors                                                              
  <ord> <chr>                                                                   
1 D     Very Good;Very Good;Very Good;Good;Good;Premium;Premium;Ideal;Ideal;Ver…
2 E     Ideal;Premium;Good;Fair;Premium;Premium;Very Good;Very Good;Very Good;G…
3 F     Premium;Very Good;Very Good;Very Good;Good;Premium;Very Good;Very Good;…
4 G     Very Good;Ideal;Ideal;Very Good;Premium;Premium;Ideal;Very Good;Ideal;P…
5 H     Very Good;Very Good;Very Good;Good;Good;Very Good;Good;Very Good;Very G…
6 I     Premium;Very Good;Ideal;Good;Premium;Ideal;Ideal;Ideal;Ideal;Very Good;…
7 J     Good;Very Good;Good;Ideal;Ideal;Good;Good;Very Good;Very Good;Very Good…

Thanks to Ken Butler, who pointed out that the tidyverse way (via stringr) is to use str_c instead:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              stringr::str_c(cut, collapse=";")) 
# A tibble: 7 × 2
  color all_colors                                                              
  <ord> <chr>                                                                   
1 D     Very Good;Very Good;Very Good;Good;Good;Premium;Premium;Ideal;Ideal;Ver…
2 E     Ideal;Premium;Good;Fair;Premium;Premium;Very Good;Very Good;Very Good;G…
3 F     Premium;Very Good;Very Good;Very Good;Good;Premium;Very Good;Very Good;…
4 G     Very Good;Ideal;Ideal;Very Good;Premium;Premium;Ideal;Very Good;Ideal;P…
5 H     Very Good;Very Good;Very Good;Good;Good;Very Good;Good;Very Good;Very G…
6 I     Premium;Very Good;Ideal;Good;Premium;Ideal;Ideal;Ideal;Ideal;Very Good;…
7 J     Good;Very Good;Good;Ideal;Ideal;Good;Good;Very Good;Very Good;Very Good…

Finally, if we wanted to just get the unique values of the cuts in a single line, we can use unique:

diamonds %>% select(color, cut) %>% 
  group_by(color) %>% 
  summarize(all_colors=
              paste(unique(cut), collapse=";")) 
# A tibble: 7 × 2
  color all_colors                       
  <ord> <chr>                            
1 D     Very Good;Good;Premium;Ideal;Fair
2 E     Ideal;Premium;Good;Fair;Very Good
3 F     Premium;Very Good;Good;Fair;Ideal
4 G     Very Good;Ideal;Premium;Good;Fair
5 H     Very Good;Good;Premium;Fair;Ideal
6 I     Premium;Very Good;Ideal;Good;Fair
7 J     Good;Very Good;Ideal;Premium;Fair

Citation

BibTeX citation:
@online{laderas2018,
  author = {Ted Laderas},
  title = {Group\_by/Summarize: {Not} Just for Numeric Values!},
  date = {2018-05-24},
  url = {https://laderast.github.io//articles/2018-05-24-group-by-summarize-not-just-for-numeric-values/2018-05-24-group-by-summarize-not-just-for-numeric-values.html},
  langid = {en}
}
For attribution, please cite this work as:
Ted Laderas. 2018. “Group_by/Summarize: Not Just for Numeric Values!” May 24, 2018. https://laderast.github.io//articles/2018-05-24-group-by-summarize-not-just-for-numeric-values/2018-05-24-group-by-summarize-not-just-for-numeric-values.html.