Introduction to Bioinformatics

class: center, middle, inverse, title-slide

# Introduction to Bioinformatics
## PHE 427: Health Informatics
### Ted Laderas
### 2019-04-15

---

# What we'll do today

+ What is the Goal of Bioinformatics?
+ Simple Genetics
+ Simple Phenotyping
+ Genome Wide Association
+ In-Class Activity
+ Discussion

---

# What is Bioinformatics?

Using informatics methods to connect biology (Genotypes) with Phenotypes

- Biology (not just genes!)
    - genomics
    - gene expression, 
    - immune system state
    - gut bacteria (microbiome)
- Phenotypes: A disease or trait of interest
    - Type 2 Diabetes
    - What are some others?

---

# What are these informatics methods?

Many, many techniques!

- statistics 
- systems science
- machine learning
- systems biology
- visualization
- knowledge integration

---

# What else do you have to know?

- Basic Biology
    - Biological Regulation
- Basic Clinical Knowledge
- Limitations of the technology used
- Exciting (and stressful) because I learn new stuff everyday!
- Most important quality: Curiosity!

---

# Learning Objectives (hint)

+ What is a SNP variant?
+ How do we define a disease of interest?
+ What is a Genome Wide Association Study (GWAS)?
    - How do we look for association?
+ Why are Odds Ratios important?

---
# Some Definitions before we start

+ Phenotype:
    - a condition we're interested in, such as disease
    - can also be a trait, such as height or BMI
    - varies across a population
+ Genotype:
    - the genetic/biological makeup of a subject we're interested in,
    - varies across a population
+ Treatment:
    - what we can do about this given phenotype and genotype
    - can be a drug or other kind of therapy (like chemotherapy)

---

# Let's look at one example - 23andMe

.footnote[https://www.23andme.com/howitworks/]

---

# A personalized report on a variant

.footnote[https://permalinks.23andme.com/pdf/samplereport_genetichealth.pdf]

---

# What else matters in Alzheimer's?

.footnote[https://permalinks.23andme.com/pdf/samplereport_genetichealth.pdf]

---

# How did we figure this out?

Genome Wide Association Studies (GWAS)

---

# The Fundamental Question of GWAS

.pull-left[If I have a variant in my DNA, am I more likely to have a disease?

Need to look across a population to answer this question! ]

.pull-right[<img src="image/gwas_infographic.jpg" height = "100%" width="100%">

]
.footnote[https://www.genome.gov/20019523/genomewide-association-studies-fact-sheet/]

---

# Genotype

---

# DNA Structure

.footnote[https://cnx.org/contents/8v2Xzdco@3/The-Structure-of-DNA]

---

# DNA Fun Facts

.pull-left[DNA information in encoded in 4 bases (A,C,G,T)

Each strand is made to read in a particular direction.

Our *coding strand* is paired with another one, called a *complementary strand*, which is read in the opposite direction and has the complementary base.

We'll only consider the *coding strand* for right now.
]

.pull-right[<img src="image/AT_base_pair_jypx3.png" height = "230px">
<img src="image/GC_base_pair_jypx3.png" height = "230px">]

.footnote[https://en.wikipedia.org/wiki/Complementarity_(molecular_biology)]
---

# What is a SNP?

.pull-left[A *SNP* (Single Nucelotide Polymorphism) is a single location in the genome where we observe *variation across a population*. We need three pieces of information to characterize a SNP:

1. the chromosome it's on (chr 2)  
2. the linear position on the chromosome (3490525)  
3. the variant of interest (T)

By definition, a *SNP variant* has to occur in at least 10% of the population being studied. This is in comparison to the base that is most frequently observed, which is called the *wild-type*.]

.pull-right[<img src="image/snpDavidHall.png" height="100%" width="100%">
]
.footnote[Snp Image by David Hall / CC Licensed]
---
# We use a lot of SNPs in GWAS

Most GWAS use about 1 million of them!

Highly dispersed across the genome

---

# Parents and Genetics

Because we have two chromosomes (one from each parent), we will have two copies (values) at any SNP. Each of these copies is called an *allele*.

Knowing the value of both copies is my *genotype* for that SNP.

For example, if my dad gave me a copy (allele) with an A, and my mom gave me a copy (allele) with a G, my genotype at that SNP location would be AG.

---

# A Simplification

.pull-left[Let's just consider whether we have one or more of our variant of interest in our genotype.

So, if we were interested in the T SNP variant where the *wild-type* is A, we'll look at those individuals that had at least one T (TT and AT) as one category, and the other where it's the wild-type, or what the majority of the population has (AA).]

.pull-right[
<table>
<tbody>
  <tr>
   <td style="text-align:left;"> TT </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AT </td>
   <td style="text-align:right;"> 100 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> AA </td>
   <td style="text-align:right;"> 300 </td>
  </tr>
</tbody>
</table>

Becomes

<table>
<tbody>
  <tr>
   <td style="text-align:left;"> T </td>
   <td style="text-align:right;"> 110 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:right;"> 300 </td>
  </tr>
</tbody>
</table>
]

---
# Working definitions

- *variant* - the less frequently observed base
- *wild-type* - the more frequently observed base

What is the variant in this table?
What is the wild-type?

# Quiz yourself

If I am interested in individuals that have at least one T at Chromosome 2 at position 100430303, I am interested in:

A) a SNP  
B) a variant  
C) Both

---

# Quiz Yourself

What is the *variant* in this table?

---
layout: false
# Phenotyping

---

# Phenotyping: Disease Identification

.pull-left[- May come from clinical diagnosis
    - Needs to be clearly defined
- Can be really difficult, especially in mental health
- The more clearly we define these differences, the more successful we are
    - Is there a quantitative cutoff?
    - For example: Obesity and Body Mass Index (BMI)]
    
.pull-right[<img src="image/bmi-histogram.png" height="100%" width="100%">]

.footnote[https://thomaselove.github.io/431notes/dataviz.html]
---
# Phenotyping: PheKB

[Database from the eMERGE consortium](https:/phekb.org) defining ways to pull phenotyped groups out of an EHR:

.footnote[https://phekb.org/phenotype/type-2-diabetes-mellitus]
---

# You need to find patients

GWAS investigators need to find patients who are willing to participate in these studies.

Where do they come from?

- Clinician Referrals
- Study Coordinators
- Need to protect identity for privacy and ethical reasons

---

# Recruitment for a GWAS is Hard

.pull-left[Way too much bias towards European/Caucasians in GWAS!

+ [Genomic Analysis Reveals Why Asthma Inhalers Fail Minority Children](https://www.ucsf.edu/news/2018/03/410041/genomic-analysis-reveals-why-asthma-inhalers-fail-minority-children)

We can do better in recruiting diverse populations!

+ [Genomics is Failing on Diversity](https://www.nature.com/news/genomics-is-failing-on-diversity-1.20759)]

.pull-right[
<img src="image/genomics_ethnicity.jpg" height="500px" />
]
---

# Quiz Yourself

Which of the following are phenotypes?

a) BMI  
b) A SNP  
c) Schizophrenia  
d) TV show preference

---
# Frequency in underserved populations

SNPedia Link for Rs9939609: https://www.snpedia.com/index.php/Rs9939609

Scroll Down to see frequency chart

---
# Variant Frequency in Population

![SNPedia Frequencies](image/snpFrequencies.png)

For frequency, just report percent of variant (smallest number)

---
# Association

.footnote[https://www.genome.gov/20019523/genomewide-association-studies-fact-sheet/]
---

# GWAS is a Test

GWAS is a test of whether genotype (SNP) is predictive of phenotype (disease)

Think of a SNP variant as being a diagnostic test.

We want to quantify how good our test is. Much like in epidemiology, we will construct a 2x2 table to find associations of genotypes and phenotypes.
---
# 2x2 Tables for Genotype/Phenotype Association

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> SNP+ </th>
   <th style="text-align:right;"> SNP- </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Disease+ </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Disease- </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 100 </td>
  </tr>
</tbody>
</table>

- Relate genetic variations (SNPs) to disease, one position at a time
    - **(SNP+/Disease+)**
    - (SNP+/Disease-)
    - (SNP-/Disease-)
    - **(SNP-/Disease-)**

---
# Statistical Testing for Association

Are the proportions of people who have the SNP in the case (Disease+) and control (Disease-) the same?

.footnote[By Lasse Folkersen - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=18062562]
---
# A difference in proportions

Here is the 2x2 table and a proportional barplot, where we are looking at the difference between the proportions of SNP+ in the Disease+ versus Disease- cases.

Just looking at the barplot, do you think there is a difference in the proportions between Disease+ and Disease-?

.pull-left[
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> SNP+ </th>
   <th style="text-align:right;"> SNP- </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Disease+ </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Disease- </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 100 </td>
  </tr>
</tbody>
</table>
  ]

.pull-right[
![](index_files/figure-html/unnamed-chunk-12-1.png)
]

---

# Which SNPs are important?

We scan each SNP, and conduct a statistical test of association to produce a p-value using a test of assocation (Fisher's Exact or Chi-Squared) between SNP Variant and Disease Status.

We need to set a very small criteria of significance. Then we scan across the genome and find the really small p-values:

<img src="image/Manhattan_Plot.png" height = "350px">
.footnote[https://en.wikipedia.org/wiki/Manhattan_plot#/media/File:Manhattan_Plot.png]

---

# Odds Ratio

Odds Ratio (OR): Very useful measure for assessing degree of association between disease and SNP variant.

Estimate of how different the proportions of SNP+/Disease+ and SNP+/Disease- are. SNP+ is when someone has the variant and SNP- is when they don't.

`\(OR = \frac{Odds(SNP+/Disease+)}{Odds(SNP+/Disease-)}\)`

---
# Probabilities

Odds are different than probabilities. Probabilities are expressed in terms of the total:

If we have 10 cases out of 100 total patients, the probability is

- probability of being a case = `\(\frac{numCases}{numTotalPatients}\)`  
- probability of being a case = `\(\frac{10}{100} = 0.1\)`

---
# Odds are different than probabilities

- Odds of being a case = `\(\frac{numCases}{numControls}\)`  
- Odds of being a case = `\(\frac{10}{90} = 0.1111\)`

Not quite the same as a probability!

---

# Odds Ratio

`\(OR = \frac{Odds(SNP+/Disease+)}{Odds(SNP+/Disease-)}\)`
---

# Odds Having Disease+ when SNP+

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> SNP+ </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Disease+ </td>
   <td style="text-align:right;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Disease- </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
</tbody>
</table>

Odds(SNP+/Disease+) = `\(\frac{Number(SNP+/Disease+)}{Number(SNP+/Disease-)}\)`

Odds(SNP+/Disease+) = `\(\frac{20}{5} = 4\)`

---
# Odds having Disease- when SNP+

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> SNP- </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Disease+ </td>
   <td style="text-align:right;"> 4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Disease- </td>
   <td style="text-align:right;"> 100 </td>
  </tr>
</tbody>
</table>

Odds(SNP-/Disease+) = `\(\frac{Number(SNP-/Disease+)} {Number(SNP-/Disease-)}\)`

Odds(SNP-/Disease+) = `\(\frac{4}{100} = 0.04\)`

---
# Odds Ratio

OR = `\(\frac{Odds(SNP+/Disease+)}{Odds(SNP-/Disease+)}\)`  
OR = `\(\frac{4}{0.04} = 100\)`

---
# Interpreting the Odds Ratio:

Estimate of how different the proportions of SNP+/Disease+ and SNP+/Disease- are:

- OR > 1: Disease/SNP association
      - This is what we're usually looking for in GWAS
      - If the OR is 2, my odds are 2:1 that I am (Disease+) compared to being (Disease-) if I have the variant (SNP+)
- OR = 1: There is no association between having the disease and the SNP variant
- OR < 1: Having the SNP variant means you are less likely to have disease
      - Protective effect: also interesting

---

# We need to repeat our study

- Need to do a validation study in a separate population to confirm our associations. Why?
- Need to understand whether it is valid for larger populations outside of the one studied

---

# Evidence to consider in a GWAS

- For your SNP, what is the odds ratio?
    - Rule of thumb: big OR is at least 2
- How big was the population used in your study?
- Who was the population used in your study?

---

# In-Class Assignment

Each of you will pick a SNP to Investigate. Use the

Fill out the Google Form here: http://bit.ly/phe427snp

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Number </th>
   <th style="text-align:left;"> Phenotype </th>
   <th style="text-align:left;"> SNP.name </th>
   <th style="text-align:left;"> Variant </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> Obesity </td>
   <td style="text-align:left;"> rs7185735 </td>
   <td style="text-align:left;"> G </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> Schizophrenia </td>
   <td style="text-align:left;"> rs2237457 </td>
   <td style="text-align:left;"> T </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> Cardiovascular Risk </td>
   <td style="text-align:left;"> rs6843082 </td>
   <td style="text-align:left;"> G </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> Cardiovascular Risk </td>
   <td style="text-align:left;"> rs6025 </td>
   <td style="text-align:left;"> T </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:left;"> Type 2 Diabetes Mellitus </td>
   <td style="text-align:left;"> rs7903146 </td>
   <td style="text-align:left;"> T </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> Scoliosis </td>
   <td style="text-align:left;"> rs11190870 </td>
   <td style="text-align:left;"> T </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:left;"> Alcohol Dependence </td>
   <td style="text-align:left;"> rs75433892 </td>
   <td style="text-align:left;"> A </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:left;"> Lupus </td>
   <td style="text-align:left;"> rs7329174 </td>
   <td style="text-align:left;"> G </td>
  </tr>
</tbody>
</table>

Important: to be graded for this assignment, must provide your name on the Google Form!

---
# Resources Needed for Assignment

- GWAS Catalog: https://www.ebi.ac.uk/gwas/
- SNPedia (for underserved population frequencies): https://www.snpedia.com/index.php/SNPedia

---

# Discussion

1) How confident were you in your SNP variant?  
2) What evidence made you feel confident/less confident? 
3) From SNPedia, examine how common is the variant in an underserved population of interest

---

# Our statistics needs to be better

- There are a lot of SNPs: over 3 million!
- Need to worry about false associations
- Multiple Comparison Adjustment

---
# One Caveat

- We may find a region associated with our disease, but it's not the same as knowing the mechanism.

- We need to do further investigations of these regions to narrow the cause down.

- Is the SNP causing a functional difference? Can try to use gene editing to assess its affect in animal models.

- [Following up on a GWAS](https://www.genome.gov/pages/about/od/opg/designinggeneticists/schanock-genomic_technologies.pdf)

---
# Single gene SNPs are not enough

Human genetics are complex - need to understand how combinations of variants work together.

- Polygenic Risk Scores (multiple genes and SNPs)
    - Use multiple SNPs as a test for a disease

---
# Open Questions

- How do we make GWAS more equitable to all?  
    - Ethical issues of privacy and equitability
- How do we make SNP information accessible and understandable?
- Need to make lots of information readily understandable
    - Clinicians
    - Consumers
- Need for transparent and open science!