The following is the site for the materials for the clinical data wrangling workshop. This is a 10 hour workshop (spread over 4 days) where students got to work with a real research dataset (the Sleep Heart Health Study data).

This is a workshop that we developed as part of an National Library of Medicine T15 training supplement in Data Science. The following is a short report describing the workshop and its outcomes.

Intended Audience

We designed the workshop for our incoming informatics students (clinical and biology majors) in order to introduce them to the difficulties of working with clinical data. We anticipate that with a little adaptation, it should be accessible to audiences such as medical students, and other clinicians wanting to understand the nature of clinical data.

Code of Conduct

We have included our Code of Conduct for the workshop. We believe that it helps make the workshop to be a more inclusive environment and encourages group learning among participants.

Learning Objectives

  1. Understand biological and clinical concepts that are relevant to sleep apnea and cardiovascular disease.
  2. Explore the Sleep Heart Health Study (SHHS) dataset in light of these concepts as teams.
  3. Evaluate fitness for use of covariates in the mode based on the clinical data wrangling framework in order to select appropriate covariates.
  4. Build and evaluate a predictive model based on the decisions made in 3).
  5. Communicate and compare model results with other teams.

The Dataset

We used the Sleep Heart Health Study dataset from the National Sleep Research Resource. This is a dataset of approximately 5800 patients that have over 3000 covariates. We limited our students to a smaller number of covariates (17), including our outcome of interest, cardiovascular disease.

Please note that the dataset is not currently available in the lesson repository. A Data Access and Use Agreement (DAUA, see below) needs to be filled out for each student who wishes to access the SHHS dataset.

Requirements

  1. Students must have R/Rstudio installed on their computers (See installation instructions)

Additionally, students should run the following commands in their console to install needed packages:

install.packages("remotes")
#install the data explorer
remotes::install_github("laderast/burro")
#install the caret package 
install.packages(pkgs = "caret", dependencies = c("Depends", "Imports"))
  1. Students must fill out a Data Access and Use Agreement for NSRR

  2. Students must have training covering basics of PHI and HIPAA (required by NSRR for their Data Access and Use Agreement)

  3. Students should clone or download the repo

Workshop Format

We designed the workshop to be a mix of didactic lectures and active learning exercises. Where possible, we had students work in groups to answer questions about the data. These activities included a data scavenger hunt using our EDA exploration app, and a logistic modeling exercise.

Day 1

Session Lecture/Activity Format Duration
0 Introduction, Logistics, Groups assigned NA 30 min
1a Biology of Sleep and Cardiovascular Disease Lecture with questions 30 min
Break Breaktime NA 15 min
1b The Value of Clinical Data Lecture 15 min
2a Exploring the Sleep Heart Health Study Dataset Data
Scavenger
Hunt
90 min
Break Lunch Break (with optional R install session) NA 60 min
2b Clinical Data Quality Issues/Applying the Clinical Wrangling Process Lecture 45 min
3b Logistic Regression Model Basics R Notebook 90 min

Day 2

Session Lecture/Activity Format Duration
4a Question/Answer session about Logistic Regression Notebook Q&A 50 min
4b Assignment about race variable (assigned to groups) Homework 10 min

Day 3

Session Lecture/Activity Format Duration
5a Discussion about race as a covariate, sharing of findings Discussion 30 min
5b Overview of hypertension and how it relates to Sleep
Apnea/Cardiovascular Disease
Lecture/Discussion 30 min
5c Work on Final Report In-class Lab time 60 min

Day 4

Session Lecture/Activity Format Duration
6a Group presentations about covariate decisions and resulting model R Notebook 60 min
6b Final Discussion and Wrap up Discussion 30 min

Acknowledgements

We are grateful for the incoming informatics students’ enthusiasm and patience. Also thanks to the NLM T15 Supplement in Data Science, without which we would not have gotten the opportunity to conceptualize, put together, and deliver this workshop. Thanks again to Susan Redline and the National Sleep Research Resource group, especially Dan Mobley who helped us with the last-minute data use agreements.

Licensing

This lesson material is shared under a Creative Commons Non-Commercial BY 3.0 license. All code is shared under an Apache 2.0 License.