Underrated Tidyverse Functions

Learn about our assignment to teach the tidyverse to each other.
Author
Affiliation
Published

December 1, 2020

The Assignment

I’m teaching an R Programming course next term. Jessica Minnier and I are developing the Ready for R Materials into a longer and more involved course.

I think one of the most important things is to teach people how to self-learn. As learning to program is a lifelong learning activity, it’s critically important to give them these meta-learning skills. So that’s the motivation behind the Tidyverse function of the Week assignment.

I asked on Twitter:

Some of my favorite suggestions

Here are some of the highlights from the thread.

I loved all of these. Danielle Quinn wins the MVP award for naming so many useful functions:

fill() was highly suggested:

Many people suggested the window functions, including lead() and lag() and the cumulative functions:

Alison Hill suggested problems(), which helps you diagnose why your data isn’t loading:

I think that deframe() and enframe() are really exciting, since I do this operation all the time:

unite(), separate() and separate_rows() also had their own contingent:

Wow! Let’s Grab All the Tweets and Replies

I was bowled over by all of the replies. This was an unexpectedly really fun thread, and lots of recommendations from others.

I thought I would try and summarize everyone’s suggestions and compile a list of recommended functions. I used this script with some modifications to pull all the replies to my tweet. In particular, I had to request for extended tweet mode, and I extracted a few more fields from the returned JSON.

This wrote the tweet information into a CSV file.

Then I started parsing the data. I wrote a couple of functions, remove_users_from_text(), which removes the users from a tweet (by looking for words that begin with @) and get_funcs(), which uses a relatively simple regular expression to try to return the function (it looks for paired parentheses () or an underscore “-” to extract the functions). It actually works pretty well, and grabs most of the functions.

Then I use separate_rows() to split the multiple functions into their separate rows. This makes it easier to tally all the functions.

remove_users_from_text <- function(col){
  str_replace_all(col, "\\@\\w*", "")
  
}

get_funcs <- function(col){
  out <- str_extract_all(col, "\\w*\\(\\)|\\w*_\\w*")
  paste(out[[1]], collapse=", ")  
}

parsed_tweets <- tweets %>%
  rowwise() %>%
  mutate(text = remove_users_from_text(text)) %>%
  mutate(funcs = get_funcs(text)) %>%
  ungroup() %>%
  separate_rows(funcs, sep=", ") %>%
  select(date, user, funcs, text, reply, parent_thread) %>%
  distinct()

write_csv(parsed_tweets, file = "cleaned_tweets_incomplete.csv")

rmarkdown::paged_table(parsed_tweets[1:10,-c(5:6)])

At this point, I realized that I just needed to hand annotate the rest of the tweets, rather than wasting my time trying to parse the rest of the cases. So I pulled everything into Excel and just annotated the ones which I couldn’t pull from.

Functions by frequency

Here are the function suggestions by frequency. Unsurprisingly, case_when() (which I cover in the main course), has the most number of suggestions, because it’s so useful. tidyr::pivot_wider() and tidyr::pivot_longer() are also covered in the course.

There are some others which were new to me, and a bit of a surprise, such as coalesce(), fill().

cleaned_tweets <- read_csv("cleaned_tweets.csv") %>% select(-parent_thread) %>%
  mutate(user = paste0("[",user,"](",reply,")")) %>%
  select(-reply)
Rows: 266 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): date, user, funcs, text, reply, parent_thread

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
functions_by_freq <- cleaned_tweets %>%
  janitor::tabyl(funcs) %>%
  filter(!is.na(funcs)) %>%
  arrange(desc(n)) 

write_csv(functions_by_freq, "functions_by_frequency.csv")

functions_by_freq %>%
  rmarkdown::paged_table()

Cleaned Tweets and Threads

Here’s all of the tweets from this thread (naysayers included). They are in somewhat order (longer threads are grouped).

Here’s a link to the cleaned CSV file

rmarkdown::paged_table(cleaned_tweets)

Source Code and Data

Feel free to use and modify.

Thank You

This post is my thank you for everyone who contributed to this thread. Thank you!

Citation

BibTeX citation:
@online{laderas2020,
  author = {Laderas, Ted},
  title = {Underrated {Tidyverse} {Functions}},
  date = {2020-12-01},
  url = {https://laderast.github.io/articles/tidyverse_functions/},
  langid = {en}
}
For attribution, please cite this work as:
Laderas, Ted. 2020. “Underrated Tidyverse Functions.” December 1, 2020. https://laderast.github.io/articles/tidyverse_functions/.