Clinical Data Wrangling Workshop

The following is the site for the materials for the clinical data wrangling workshop. This is a 10 hour workshop (spread over 4 days) where students got to work with a real research dataset (the Sleep Heart Health Study data).

This is a workshop that we developed as part of an National Library of Medicine T15 training supplement in Data Science. The following is a short report describing the workshop and its outcomes.

Intended Audience

We designed the workshop for our incoming informatics students (clinical and biology majors) in order to introduce them to the difficulties of working with clinical data. We anticipate that with a little adaptation, it should be accessible to audiences such as medical students, and other clinicians wanting to understand the nature of clinical data.

Code of Conduct

We have included our Code of Conduct for the workshop. We believe that it helps make the workshop to be a more inclusive environment and encourages group learning among participants.

Learning Objectives

Understand biological and clinical concepts that are relevant to sleep apnea and cardiovascular disease.
Explore the Sleep Heart Health Study (SHHS) dataset in light of these concepts as teams.
Evaluate fitness for use of covariates in the mode based on the clinical data wrangling framework in order to select appropriate covariates.
Build and evaluate a predictive model based on the decisions made in 3).
Communicate and compare model results with other teams.

The Dataset

We used the Sleep Heart Health Study dataset from the National Sleep Research Resource. This is a dataset of approximately 5800 patients that have over 3000 covariates. We limited our students to a smaller number of covariates (17), including our outcome of interest, cardiovascular disease.

Please note that the dataset is not currently available in the lesson repository. A Data Access and Use Agreement (DAUA, see below) needs to be filled out for each student who wishes to access the SHHS dataset.

Requirements

Students must have R/Rstudio installed on their computers (See installation instructions)

Additionally, students should run the following commands in their console to install needed packages:

install.packages("remotes")
#install the data explorer
remotes::install_github("laderast/burro")
#install the caret package 
install.packages(pkgs = "caret", dependencies = c("Depends", "Imports"))

Students must fill out a Data Access and Use Agreement for NSRR
Students must have training covering basics of PHI and HIPAA (required by NSRR for their Data Access and Use Agreement)
Students should clone or download the repo

Workshop Format

We designed the workshop to be a mix of didactic lectures and active learning exercises. Where possible, we had students work in groups to answer questions about the data. These activities included a data scavenger hunt using our EDA exploration app, and a logistic modeling exercise.

Day 1

Session	Lecture/Activity	Format	Duration
0	Introduction, Logistics, Groups assigned	NA	30 min
1a	Biology of Sleep and Cardiovascular Disease	Lecture with questions	30 min
Break	Breaktime	NA	15 min
1b	The Value of Clinical Data	Lecture	15 min
2a	Exploring the Sleep Heart Health Study Dataset	Data Scavenger Hunt	90 min
Break	Lunch Break (with optional R install session)	NA	60 min
2b	Clinical Data Quality Issues/Applying the Clinical Wrangling Process	Lecture	45 min
3b	Logistic Regression Model Basics	R Notebook	90 min

Day 2

Session	Lecture/Activity	Format	Duration
4a	Question/Answer session about Logistic Regression Notebook	Q&A	50 min
4b	Assignment about race variable (assigned to groups)	Homework	10 min

Day 3

Session	Lecture/Activity	Format	Duration
5a	Discussion about race as a covariate, sharing of findings	Discussion	30 min
5b	Overview of hypertension and how it relates to Sleep Apnea/Cardiovascular Disease	Lecture/Discussion	30 min
5c	Work on Final Report	In-class Lab time	60 min

Day 4

Session	Lecture/Activity	Format	Duration
6a	Group presentations about covariate decisions and resulting model	R Notebook	60 min
6b	Final Discussion and Wrap up	Discussion	30 min

Acknowledgements

We are grateful for the incoming informatics students’ enthusiasm and patience. Also thanks to the NLM T15 Supplement in Data Science, without which we would not have gotten the opportunity to conceptualize, put together, and deliver this workshop. Thanks again to Susan Redline and the National Sleep Research Resource group, especially Dan Mobley who helped us with the last-minute data use agreements.

Licensing

This lesson material is shared under a Creative Commons Non-Commercial BY 3.0 license. All code is shared under an Apache 2.0 License.