The following is the site for the materials for the clinical data wrangling workshop. This is a 10 hour workshop (spread over 4 days) where students got to work with a real research dataset (the Sleep Heart Health Study data).
This is a workshop that we developed as part of an National Library of Medicine T15 training supplement in Data Science. The following is a short report describing the workshop and its outcomes.
We designed the workshop for our incoming informatics students (clinical and biology majors) in order to introduce them to the difficulties of working with clinical data. We anticipate that with a little adaptation, it should be accessible to audiences such as medical students, and other clinicians wanting to understand the nature of clinical data.
We have included our Code of Conduct for the workshop. We believe that it helps make the workshop to be a more inclusive environment and encourages group learning among participants.
We used the Sleep Heart Health Study dataset from the National Sleep Research Resource. This is a dataset of approximately 5800 patients that have over 3000 covariates. We limited our students to a smaller number of covariates (17), including our outcome of interest, cardiovascular disease.
Please note that the dataset is not currently available in the lesson repository. A Data Access and Use Agreement (DAUA, see below) needs to be filled out for each student who wishes to access the SHHS dataset.
Additionally, students should run the following commands in their console to install needed packages:
install.packages("remotes")
#install the data explorer
remotes::install_github("laderast/burro")
#install the caret package
install.packages(pkgs = "caret", dependencies = c("Depends", "Imports"))
Students must fill out a Data Access and Use Agreement for NSRR
Students must have training covering basics of PHI and HIPAA (required by NSRR for their Data Access and Use Agreement)
Students should clone or download the repo
We designed the workshop to be a mix of didactic lectures and active learning exercises. Where possible, we had students work in groups to answer questions about the data. These activities included a data scavenger hunt using our EDA exploration app, and a logistic modeling exercise.
Session | Lecture/Activity | Format | Duration |
---|---|---|---|
0 | Introduction, Logistics, Groups assigned | NA | 30 min |
1a | Biology of Sleep and Cardiovascular Disease | Lecture with questions | 30 min |
Break | Breaktime | NA | 15 min |
1b | The Value of Clinical Data | Lecture | 15 min |
2a | Exploring the Sleep Heart Health Study Dataset |
Data Scavenger Hunt |
90 min |
Break | Lunch Break (with optional R install session) | NA | 60 min |
2b | Clinical Data Quality Issues/Applying the Clinical Wrangling Process | Lecture | 45 min |
3b | Logistic Regression Model Basics | R Notebook | 90 min |
Session | Lecture/Activity | Format | Duration |
---|---|---|---|
4a | Question/Answer session about Logistic Regression Notebook | Q&A | 50 min |
4b | Assignment about race variable (assigned to groups) | Homework | 10 min |
Session | Lecture/Activity | Format | Duration |
---|---|---|---|
5a | Discussion about race as a covariate, sharing of findings | Discussion | 30 min |
5b |
Overview of hypertension and how it relates to Sleep Apnea/Cardiovascular Disease |
Lecture/Discussion | 30 min |
5c | Work on Final Report | In-class Lab time | 60 min |
Session | Lecture/Activity | Format | Duration |
---|---|---|---|
6a | Group presentations about covariate decisions and resulting model | R Notebook | 60 min |
6b | Final Discussion and Wrap up | Discussion | 30 min |
We are grateful for the incoming informatics students’ enthusiasm and patience. Also thanks to the NLM T15 Supplement in Data Science, without which we would not have gotten the opportunity to conceptualize, put together, and deliver this workshop. Thanks again to Susan Redline and the National Sleep Research Resource group, especially Dan Mobley who helped us with the last-minute data use agreements.
This lesson material is shared under a Creative Commons Non-Commercial BY 3.0 license. All code is shared under an Apache 2.0 License.