<- function(col){
remove_users_from_text str_replace_all(col, "\\@\\w*", "")
}
<- function(col){
get_funcs <- str_extract_all(col, "\\w*\\(\\)|\\w*_\\w*")
out paste(out[[1]], collapse=", ")
}
<- tweets %>%
parsed_tweets rowwise() %>%
mutate(text = remove_users_from_text(text)) %>%
mutate(funcs = get_funcs(text)) %>%
ungroup() %>%
separate_rows(funcs, sep=", ") %>%
select(date, user, funcs, text, reply, parent_thread) %>%
distinct()
write_csv(parsed_tweets, file = "cleaned_tweets_incomplete.csv")
::paged_table(parsed_tweets[1:10,-c(5:6)]) rmarkdown
The Assignment
I’m teaching an R Programming course next term. Jessica Minnier and I are developing the Ready for R Materials into a longer and more involved course.
I think one of the most important things is to teach people how to self-learn. As learning to program is a lifelong learning activity, it’s critically important to give them these meta-learning skills. So that’s the motivation behind the Tidyverse function of the Week assignment.
I asked on Twitter:
Hi Everyone. I’m teaching an #rstats course next quarter.
One assignment is to have each student write about a #tidyverse function. What it’s for and an example.
What are some less known #tidyverse functions that do a job you find useful?— Ted Laderas, PhD 🏳️🌈 (@tladeras) November 30, 2020
Some of my favorite suggestions
Here are some of the highlights from the thread.
I loved all of these. Danielle Quinn wins the MVP award for naming so many useful functions:
dplyr::uncount()
tidyr::complete()
tidyr::fill() / replace_na()
stringr::str_detect() / str_which()
lubridate::ymd_hms() and related functions
ggplot2::labs() - so simple, yet under appreciated!— Danielle Quinn (she/her) (@daniellequinn88) December 1, 2020
fill()
was highly suggested:
tidyr::fill() - extremely useful when creating a usable dataset out of a spreadsheet originally built for data entry, in which redundant informations are only reported once at the beginning of the group they refer to, rather than in every row as needed for the analysis.
— Luca Foppoli (@foppoli_luca) December 1, 2020
Many people suggested the window functions, including lead()
and lag()
and the cumulative functions:
Check out the dplyr window functions, cummin, cummax, cumany and cumall. They don’t seen useful at first but they can solve really tricky aggregation problems. https://t.co/aDpXqSB2Vx
— Robert Kubinec (@rmkubinec) December 1, 2020
Alison Hill suggested problems()
, which helps you diagnose why your data isn’t loading:
Ooh problems is a good function for importing rx https://t.co/P4ZR57PgOG
— Alison Presmanes Hill (@apreshill) December 1, 2020
I think that deframe()
and enframe()
are really exciting, since I do this operation all the time:
tibble::deframe(), tibble::deframe()
coercing a two-column df to named vector, which I prefer immensely to names(df) <- vec_of_names— E. David Aja (@PeeltothePithy) December 1, 2020
unite()
, separate()
and separate_rows()
also had their own contingent:
I find myself using tidyr::unite() a lot to clean messy data - particularly useful for making unique and informative ID’s for each row. coalesce() and fill() are also little known gems! :)
— Guy Sutton🐝🌾🇿🇦🇿🇼 (@Guy_F_Sutton) December 1, 2020
Wow! Let’s Grab All the Tweets and Replies
I was bowled over by all of the replies. This was an unexpectedly really fun thread, and lots of recommendations from others.
I thought I would try and summarize everyone’s suggestions and compile a list of recommended functions. I used this script with some modifications to pull all the replies to my tweet. In particular, I had to request for extended
tweet mode, and I extracted a few more fields from the returned JSON.
This wrote the tweet information into a CSV file.
Then I started parsing the data. I wrote a couple of functions, remove_users_from_text()
, which removes the users from a tweet (by looking for words that begin with @
) and get_funcs()
, which uses a relatively simple regular expression to try to return the function (it looks for paired parentheses ()
or an underscore “-” to extract the functions). It actually works pretty well, and grabs most of the functions.
Then I use separate_rows()
to split the multiple functions into their separate rows. This makes it easier to tally all the functions.
At this point, I realized that I just needed to hand annotate the rest of the tweets, rather than wasting my time trying to parse the rest of the cases. So I pulled everything into Excel and just annotated the ones which I couldn’t pull from.
Functions by frequency
Here are the function suggestions by frequency. Unsurprisingly, case_when()
(which I cover in the main course), has the most number of suggestions, because it’s so useful. tidyr::pivot_wider()
and tidyr::pivot_longer()
are also covered in the course.
There are some others which were new to me, and a bit of a surprise, such as coalesce()
, fill()
.
<- read_csv("cleaned_tweets.csv") %>% select(-parent_thread) %>%
cleaned_tweets mutate(user = paste0("[",user,"](",reply,")")) %>%
select(-reply)
Rows: 266 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): date, user, funcs, text, reply, parent_thread
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- cleaned_tweets %>%
functions_by_freq ::tabyl(funcs) %>%
janitorfilter(!is.na(funcs)) %>%
arrange(desc(n))
write_csv(functions_by_freq, "functions_by_frequency.csv")
%>%
functions_by_freq ::paged_table() rmarkdown
Cleaned Tweets and Threads
Here’s all of the tweets from this thread (naysayers included). They are in somewhat order (longer threads are grouped).
Here’s a link to the cleaned CSV file
::paged_table(cleaned_tweets) rmarkdown
Source Code and Data
Feel free to use and modify.
- RMarkdown file used to generate this post
- Python Twitter Scraper (by Giovanni Mellini) - I used this because there wasn’t a ready made recipe in
rtweet
to extract replies - you have to use recursion to extract all of the thread replies that belong to a tweet, and this was easily modifiable. - Cleaned Tweets File (CSV)
Thank You
This post is my thank you for everyone who contributed to this thread. Thank you!
Citation
@online{laderas2020,
author = {Laderas, Ted},
title = {Underrated {Tidyverse} {Functions}},
date = {2020-12-01},
url = {https://laderast.github.io/articles/tidyverse_functions/},
langid = {en}
}