The following is the site for the materials for the clinical data wrangling workshop that was held from September 21 to September 26, 2018. This was a 12 hour workshop (spread over 4 days) where students got to work with a real research dataset (the Sleep Heart Health Study data). This is a workshop that we developed as part of an National Library of Medicine T15 training supplement in Data Science. The following is a short report describing the workshop and its outcomes.

Intended Audience

We designed the workshop for our incoming informatics students (clinical and biology majors) in order to introduce them to the difficulties of working with clinical data. We anticipate that with a little adaptation, it should be accessible to audiences such as medical students, and other clinicians wanting to understand the nature of clinical data.

Code of Conduct

We have included our Code of Conduct for the workshop. We believe that it helps make the workshop to be a more inclusive environment and encourages group learning among participants.

Learning Objectives

  1. Understand the biology of sleep and sleep apnea and how the biology informs the covariates measured in the Sleep Heart Health Study
  2. Understand why clinical data is useful and also why it’s difficult to work with
  3. Learn Exploratory Data Analysis techniques and use them to inform model building.
  4. Learn to assess logistic regression models using simple diagnostics.

The Dataset

We used the Sleep Heart Health Study dataset from the National Sleep Research Resource. This is a dataset of approximately 5800 patients that have over 3000 covariates. We limited our students to a smaller number of covariates (17), including our outcome of interest, cardiovascular disease.

Please note that the dataset is not currently available in the repository. A Data Access and Use Agreement (DAUA, see below) needs to be filled out for each student who wishes to access the SHHS dataset.


  1. Students must have R/Rstudio installed (See installation instructions)
  2. Students must fill out a Data Access and Use Agreement for NSRR
  3. Students must have training covering basics of PHI and HIPAA (required by NSRR for their Data Access and Use Agreement)
  4. Students should clone or download the repo

Workshop Format

We designed the workshop to be a mix of didactic lectures and active learning exercises. Where possible, we had students work in groups to answer questions about the data. These activities included a data scavenger hunt using our EDA exploration app, and a logistic modeling exercise.

Day 1

Session Lecture/Activity Format Duration
0 Introduction, Logistics, groups assigned NA 30 min
1a Biology of Sleep and Cardiovascular Disease Lecture 40 min
Break Breaktime NA 15 min
1b The Value of Clinical Data Lecture 15 min
2a Clinical Data Quality Issues Lecture 40 min
Break Lunch Break (with optional R install session) NA 90 min
2b Exploring the Sleep Heart Health Study Dataset Data
60 min
3a Applying the Clinical Wrangling Process:
Lecture 45 min
3b Logistic Regression Model Basics R Notebook 60 min

Day 2

Session Lecture/Activity Format Duration
4a Question/Answer session about Logistic Regression Notebook Q&A 50 min
4b Assignment about race variable (assigned to groups) Homework 10 min

Day 3

Session Lecture/Activity Format Duration
5a Discussion about race as a covariate, sharing of findings Discussion 30 min
5b Overview of hypertension and how it relates to Sleep
Apnea/Cardiovascular Disease
Lecture/Discussion 30 min
5c Work on Final Report In-class Lab time 60 min

Day 4

Session Lecture/Activity Format Duration
6a Group presentations about covariate decisions and resulting model R Notebook 60 min
6b Final Discussion and Wrap up Discussion 30 min


We are grateful for the incoming informatics students’ enthusiasm and patience. Also thanks to the NLM T15 Supplement in Data Science, without which we would not have gotten the opportunity to conceptualize, put together, and deliver this workshop. Thanks again to Susan Redline and the National Sleep Research Resource group, especially Dan Mobley who helped us with the last-minute data use agreements.


This lesson material is shared under a Creative Commons Non-Commercial BY 3.0 license. All code is shared under an Apache 2.0 License.