How are Data Science and Systems Science Connected?

Ted Laderas

2/15/2018

Shameless Plug: Cascadia R Conference 2018

June 2, 2018 at CLSB (Collaborative Life Sciences Building)

https://cascadiarconf.com

R Language Statistical Community
For beginners, intermediate, and advanced people
- Workshops!
- Lightning Talks!
- Data Exploration Poster Session!
Great community, meet other people working with data
Special Student Rate! (includes Beer)

A gRadual intro to Shiny

Learn about making interactive visualizations/dashboards in R

March 6, 2018
Alchemy Code Labs
30 NW 10th Avenue
6:30 PM to 8:00 PM

Please RSVP at: https://www.meetup.com/portland-r-user-group/events/247752115/

Overview

Introduction
Setting up the problem: machine learning and systems science
Feature Engineering
- Networks in Cancer
Machine Learning Interpretability: an untapped opportunity
- LIME (Local Interpretable Model-Agnostic Explanations)
- Beyond LIME: Systems Science can do better

Introduction

Assistant Professor in Bioinformatics & Computational Biology, OHSU
- Systems science alum
Dissertation work:
- Network Analysis of Mutations in Cancer
Research Work:
- Systems Biology of Disease
  - Data Integration
  - Simple models for data integration: Boolean Networks
- Flow Cytometry Analysis
- Interactive Visualization
Active in the PDX R User Group

What is Data Science?

Shlomo Argamon: At its core, data science is about making sense of the world using data.

Encompasses techniques from:

Statistical Modeling
Machine Learning
Network Science
Way more

What about Systems Science and Data Science?

Carbone 2016: Further interdisciplinary advances and deeper insights will be needed for understanding:

interactions among connected heterogeneous entities (namely space-time-dependent heterogeneous data structures)
emergence of large-scale properties of interacting entities and clustering
multi-scale data-driven approaches to identify patterns, fundamental shapes and parameters at different levels of abstraction

We need to bring more interactions to Data Science!

Data Science/Machine Learning Workflow

Refinement and Feedback
- Problem Formulation
- Data Collection
- Data Cleaning
- Feature Engineering
- Feature Selection
- Model Learning
- Interpretation

What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

Let’s look at one type of feature engineering in Head and Neck cancer.

Background: Oncogenes

Common set of genes when altered are associated with cancer
Alterations disrupt critical cellular processes that can result in cancer
- Cell Death
- Cell Proliferation
- DNA Repair
Not just one alteration causes cancer - multiple are needed

Oncogenic Collaboration and Hallmarks of Cancer

Not just one alteration, but many are involved in Cancer and they collaborate to disrupt cellular systems

Surrogate Legend

One problem: We don’t target unique alterations within patients

We look at fairly high frequency alterations
And the majority of mutations are unique/low frequency across people
Genomic data is very heterogeneous across patients
- Patient-specific mutations are the dark matter of precision medicine

Long-Tail-of-Cancer

Research Question

Do gene alterations in interacting proteins contribute to oncogenic collaboration in cancer?
Use Protein-Protein Interaction (PPI) networks to engineer new features for machine learning
Surrogate mutations: 1st degree oncogene-centered subnetworks to aggregate mutations/gene alterations

$Surrogate Legend$

Permutation Analysis on Networks

What subnetworks are significant? Use permutation analysis to decide on statistical cutoff.

Surrogates incorporate long-tail mutations

White = unique/infrequently observed, Dark Blue = frequent observed mutations

Surrogate Mutations

Surrogate Mutations are feature engineering!
Reduce space of unique mutations by aggregating them as subnetworks

Lesson: Feature Engineering needs System Approaches

Integration of knowledge/explanatory models in feature engineering

Machine Learning for Prediction

Classification Problem:

Classification Task

http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/

Big Data to Interpretability?

Deep Neural Networks - can we explain them?

Classification Task

http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/

What is interpretability?

We should think of interpretability as human simulatability. A model is simulatable if a human can take in input data together with the parameters of the model and in reasonable time, step through every calculation required to make a prediction (Lipton 2016).

Why interpretability is important

Issues of trust and bias plague machine learning and its applications!

Transparency Matters in Many Cases!

If the data scientist’s goal is to create automated processes that affect people’s lives, then he or she should regularly consider ethics in a way that academics in computer science and statistics, generally speaking, do not.

The more processes we automate, the more obvious it will become that algorithms are not inherently fair and objective, and that they need human intervention. (The Ethical Data Scientist)

Research into Interpretability is Just Starting

NIPS 2017: Interpreting, Explaining and Visualizing Deep Learning - Now What?

Proceedings of the 2017 ICML Workshop on Human Interpretability in Machine Learning (WHI 2017)

One effort: LIME

Marco Tulio Ribeiro: Local Interpretable Model-Agnostic Explanations
Why did one point get classified as X and not Y?
perturb the input, and see what the classfier outputs
Build a linear model at the boundary

Highlighting importance of features in LIME

Feature Importance in Lime

Highlighting importance of features for image classification

LIME images

Beyond LIME: What’s missing from these explanations?

Interactions! Most ML methods assume independence between variables
System Science does interactions very well
- Defines interactions precisely in terms of models
- Simulation and Modeling approaches
Let’s look at some ways of probing interactions
- Lots more opportunities to have an impact

Jakulin 2004 - Interactions via Entropy

Information Theory Based approach (much like Reconstructability Analysis)

Three way interaction

Tom Fiddaman: Simulation and Data Science

Complex interplay between big data and dynamic simulation

Dynamic models can make a black box more understandable (Fiddaman):

Can use Big Data streams as input
Posit relations between variables
Modeling varied states
Test effects of intervention
Is my model really related to the original question?

Call to Action

Data is heterogeneous
Models need to be interpretable
Systems science to the rescue!

Let’s Talk More!

Need to think about possible collaborations!

OHSU-PSU Research Faculty Mixer for collaboration

Feb 22 - 4:30 to 6:30 p.m.

Collaborative Life Sciences Building

2730 SW Moody Ave.

It’s all about telling stories

The Complex Systems and Data Science program offered by University of Vermont trains emerging data scientists to find, model, understand, and tell the stories of the patterns they uncover.

https://www.mastersportal.eu/studies/114025/complex-systems-and-data-science.html