Data Exploration of NHANES
Ted Laderas, Jessica Minnier, Thomas Frohwein
2019-01-25
Introductions
- BioData Club
- Ted Laderas
- Jessica Minnier
- Thomas Frohwein
Please Note
This workshop adheres to the BioData Club Code of Conduct.
This is done to maintain a psychologically safe and inclusive environment for everyone. Please email me at laderast@ohsu.edu or text me (503-481-8470) if you see any potential violations.
If you violate the CoC, you may be asked to leave.
These Slides are Here
- Put your post it up when you’re ready!
- If someone comes in late, please show them where the slides are
Our Overall Goal
- Look at NHANES data
- Understand how the variables are defined
- Understand associations between outcome and variable
- Understand interactions between variables
- Share insights about the data with each other
Why are we here?
- Further exposure to methods used by others
- Just getting some more hands-on experience.
- A better way to work up my experimental data
- Learn more about exploratory data analysis
- Some confidence that I can actually learn this and a kick in the pants to get started! I’ve taken stats classes, but they were many years ago.
- To get better at EDA with my work.
- Just looking to continue learning.
- Learning habits and approaches from other users. Also willing to help out newer users.
- Interested to see how an information/data scavenger hunt is set up
- I mostly work with data visualization and don’t get to do much analysis, so I need practice!
- Hands-on practice with EDA
- The ability to do EDA in an open source platform.
- A feel for what it’s like to do exploratory analysis.
- More data science skills to apply to the working world.
- To get better at analyzing data
- I would like to get more comfortable with data science tools and methods. I would also like to formulate a side-project that I can build on in the near future.
- A better grasp of the practice of data science and experience that can be applied to my career.
- Seeing different approaches and tools, collaboration
- To get an introduction to the mathematical methods behind analyzing and exploring data.
- Increase my comfort with R and statistical analyses!
- Seeing how burro works again because I didn’t have a laptop for the Data Wrangling Workshop.
What is NHANES and why are we looking at it?
- National Health And Nutrition Examination Survey
- Assess health/nutritional status of adults/children in the United States
- Types of Survey Questions:
- Demographic (Age, Race, Gender, many more…)
- Socioeconomic (Marriage Status, Household Income, Education)
- Dietary (Foods consumed, dietary supplements)
- Health (Body Mass Index, Sleep Trouble, Depression)
Please Note
- We’re not going to look at all of NHANES.
- We’re looking at an extract from two years of the survey (which years?)
- We’re ignoring how particpants were chosen/sampled from the larger population
- We’ll talk a a little about this later.
NHANES is a valuable dataset in many ways
We can understand an outcome and look at its association with measured variables in the data.
Let’s look at three outcomes today:
- Depression
- Type 2 Diabetes
- Physical Activity
Before we Start
Get into groups by your chosen outcome. Introduce yourselves, and pair off within your groups
Come up with one question about your outcome you’re curious about.
What do you expect is the case?
See if you can answer it!
What is Exploratory Data Analysis?
- Pioneered by John Tukey
- Detective work on your data
- An attitude towards data, not just techniques
- ‘Find patterns, reveal structure, and make tenative model assessments (Behrens)’
Remember
“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” - John Tukey, Exploratory Data Analysis
Why Look at your Data?
- Need to be aware of issues in the data!
Why Visualization?
EDA is about Visualization First
- Visualization is a gateway
- Understand the issues, not focus on coding right now
- Build your foundation
Running the Explorer App
We’ll start exploring the data immediately!
Go to the apps below:
We’ll separate the scavenger hunt by outcome, and we’ll ask questions, and then come back to present.
Map your questions to a tab:
- Overview
- Categorical Variable
- Continuous Variables
What is the Overview Tab for?
- Seeing how many variables are in the dataset and which type
- Seeing missing values and complete cases
- Looking up a variable in the data dictionary
Overview Tab
- What values are missing from the dataset overall? (Visual Summary)
- Are any numeric values skewed in distribution? (Tabular Summary)
- How is the variable defined? (Data Dictionary)
- What are the permissible values? (Data Dictionary)
Let’s try it!
- How many categorical variables are there? (in R, we call them factors)
- How many missing cases are there for your outcome?
- What is the mean age for the dataset?
What is the Category Tab for?
- Should we add a categorical variable to our model?
- Does my categorical variable have predictive value?
- Does adding my variable affect the number of cases I can analyse?
- Is my variable missing at random or not at random?
- Is my categorical variable confounded with another categorical variable?
Categorical Tab
- What percentages exist for my categorical variable? (Single Category)
- Is my variable associated with outcome? (Category/Outcome)
- Is my variable associated with other variables? (Crosstab)
- Are the missing values of my variable evenly distributed? (Missing Data)
Categorical Example
Do people with the most
days of LittleInterest
also have the most
days of Depression?
Categories: Let’s try it!
- How many categories are there for your outcome?
- What is the category with the largest counts for your outcome?
- Do the proportions of people with your outcome look the same for those who use marijuana versus those who don’t use it?
Continuous Tab
- What is the distribution of my categorical variable? (Single Continuous)
- Is my continuous variable associated with outcome? (Continuous/Outcome)
- Is my continuous variable associated with another categorical variable? (Boxplot)
- Is my continous variable associated with another continous variable? (Correlation)
- Is my continuous variable missing values? (Correlation)
Continuous Scatter
If you get less hours of sleep per night, does that mean you have a higher BMI?
Continuous Boxplot
If you have a lot of depressed episodes, do you also get less sleep?
Continuous: Let’s Try it!
- Is there something strange about
Age
in the dataset?
- As you get older, does your
BMI
go up?
- Do
Depressed
people have a higher systolic blood pressure than non depressed people?
Let’s start the scavenger hunt
Let’s learn from each other
Each group should present the findings from 2-3 interesting questions:
- Where did you find it in the app?
- What variables did you look at?
- What did you expect?
- What did you find?
Some Final Notes in NHANES
- NHANES was designed with a very special sampling design, that was meant to be representative to the United States
- There are weights that you must utilize for real modeling and analysis
Please!
Fill out our post survey form - we need this info to do more workshops!
Post Survey Form