Ted Laderas, Jessica Minnier, Thomas Frohwein

2019-01-25

- BioData Club
- Ted Laderas
- Jessica Minnier
- Thomas Frohwein

This workshop adheres to the BioData Club Code of Conduct.

This is done to maintain a psychologically safe and inclusive environment for everyone. Please email me at laderast@ohsu.edu or text me (503-481-8470) if you see any potential violations.

If you violate the CoC, you may be asked to leave.

- Put your post it up when you’re ready!
- If someone comes in late, please show them where the slides are

- Look at NHANES data
- Understand how the variables are defined
- Understand associations between outcome and variable
- Understand interactions between variables
- Share insights about the data with each other

- Further exposure to methods used by others
- Just getting some more hands-on experience.
- A better way to work up my experimental data
- Learn more about exploratory data analysis
- Some confidence that I can actually learn this and a kick in the pants to get started! I’ve taken stats classes, but they were many years ago.
- To get better at EDA with my work.
- Just looking to continue learning.
- Learning habits and approaches from other users. Also willing to help out newer users.
- Interested to see how an information/data scavenger hunt is set up
- I mostly work with data visualization and don’t get to do much analysis, so I need practice!
- Hands-on practice with EDA
- The ability to do EDA in an open source platform.
- A feel for what it’s like to do exploratory analysis.
- More data science skills to apply to the working world.
- To get better at analyzing data
- I would like to get more comfortable with data science tools and methods. I would also like to formulate a side-project that I can build on in the near future.
- A better grasp of the practice of data science and experience that can be applied to my career.
- Seeing different approaches and tools, collaboration
- To get an introduction to the mathematical methods behind analyzing and exploring data.
- Increase my comfort with R and statistical analyses!
- Seeing how burro works again because I didn’t have a laptop for the Data Wrangling Workshop.

**N**ational**H**ealth**A**nd**N**utrition**E**xamination**S**urvey- Assess health/nutritional status of adults/children in the United States
- Types of Survey Questions:
- Demographic (Age, Race, Gender, many more…)
- Socioeconomic (Marriage Status, Household Income, Education)
- Dietary (Foods consumed, dietary supplements)
- Health (Body Mass Index, Sleep Trouble, Depression)

- We’re not going to look at all of NHANES.
- We’re looking at an extract from two years of the survey (which years?)
- We’re ignoring how particpants were chosen/sampled from the larger population
- We’ll talk a a little about this later.

We can understand an outcome and look at its association with measured variables in the data.

Let’s look at three outcomes today:

- Depression
- Type 2 Diabetes
- Physical Activity

Get into groups by your chosen outcome. Introduce yourselves, and pair off within your groups

Come up with one question about your outcome you’re curious about.

What do you expect is the case?

See if you can answer it!

- Pioneered by John Tukey
- Detective work on your data
- An
*attitude*towards data, not just techniques - ‘Find patterns, reveal structure, and make tenative model assessments (Behrens)’

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” - John Tukey, *Exploratory Data Analysis*

- Need to be aware of issues in the data!

- Visualization is a gateway
- Understand the issues, not focus on coding right now
- Build your foundation

We’ll start exploring the data immediately!

Go to the apps below:

We’ll separate the scavenger hunt by outcome, and we’ll ask questions, and then come back to present.

- Overview
- Categorical Variable
- Continuous Variables

- Seeing how many variables are in the dataset and which type
- Seeing missing values and complete cases
- Looking up a variable in the data dictionary

- What values are missing from the dataset overall? (Visual Summary)
- Are any numeric values skewed in distribution? (Tabular Summary)
- How is the variable defined? (Data Dictionary)
- What are the permissible values? (Data Dictionary)

- How many categorical variables are there? (in R, we call them factors)
- How many missing cases are there for your outcome?
- What is the mean age for the dataset?

- Should we add a categorical variable to our model?
- Does my categorical variable have predictive value?
- Does adding my variable affect the number of cases I can analyse?
- Is my variable missing at random or not at random?
- Is my categorical variable confounded with another categorical variable?

- What percentages exist for my categorical variable? (Single Category)
- Is my variable associated with outcome? (Category/Outcome)
- Is my variable associated with other variables? (Crosstab)
- Are the missing values of my variable evenly distributed? (Missing Data)

Do people with the `most`

days of `LittleInterest`

also have the `most`

days of Depression?