Ted's Blog

    Using Synthetic Data for Teaching Data Science

    Hi Everyone, our paper called Teaching data science fundamentals through realistic synthetic clinical cardiovascular data is now available to read on Biorxiv.

    In this paper, we talk about a dataset that we synthesized for teaching aspects of clinical data that may be tricky to understand in data science. This dataset is interesting because it’s derived from a multivariate distribution based on real patient data, modeled as a Bayesian Network. Even when we knew true marginals for the real data, there was a lot of fine tuning to the Bayesian Network.

    We’ve used this dataset for a couple of classes, and we’ve found that it helps highlight real issues in predictive modeling of clinical data. One of the largest is that most predictive models are based on a much older patient cohort (50+), which means that we don’t know much about how to predict cardiovascular risk in younger patients. Part of the teaching exercise is having the students choose a cohort of interest and then attempt to predict on that patient cohort.

    The data is currently available as an R package here, including vignettes about how the data was generated: https://github.com/laderast/cvdRiskData

    Notes on Open Data Science Conference West 2017

    I just came back from the Open Data Science Conference (ODSC) in San Francisco and I found it really stimulating and interesting. I learned a ton, met some great people working in very different fields, and overall found it quite worthwhile.

    Here are some of the highlights from my notes:


    scikit-learn intro Workshop and Advanced

    I admit that I am not really a Python person. But I am helping to develop some materials for an introductory workshop and I found this workshop and its materials to be a very beginner-friendly to scikit-learn and machine learning concepts, much like caret for R. All the slides and workshop materials are available at the above links.

    SparklyR Workshop

    I liked this workshop from John Mount of Win-Vector. It started out with a dplyr intro, and introduced us to the basics of Apache Spark, which is a cluster-computing based machine learning framework, which is designed to do very large queries and machine learning. RStudio’s Edgar Ruiz managed to get us each an RStudio Pro Instance running on AWS with all the required packages installed so we could test out the SparklyR package, which uses dplyr’s commands to run Spark jobs.

    In-Memory Computing Essentials for Data Scientists

    This was an introduction to Apache Ignite, which is a distributed, in-memory database that can be leveraged by different languages. The really interesting thing about Ignite is that it will colocate related data on the same cluster node, resulting in rapid queries within each node. I think this technology will become very important as we need more datasets to be openly accessible to compute on.


    These were the most interesting talks that I attended.

    Visually Explaining Statistical and Machine Learning Concepts

    This was a great talk by Mike Freedman about his process of how he put together D3.js based visualizations to explain some statistical concepts. I thought the explanation of his process (isolate specific ideas, identify data structures, leverage visualization algorithms). Check out the slides above. They’re very cool.

    The Wonder Twins: Data Science and Human Centered Design

    This was a really interesting talk about the interplay between data science and design in helping encourage a mobile money system in Tanzania. It was inspiring to see how they had both designers and data scientists embedded and looking at how the mobile payment system worked. One interesting example was doing network analysis of the Mobile Money Agents, who distribute cash. They targeted a highly influential group of these agents based on this analysis. Very cool.

    The People’s Data and The Deontology of Data Science

    I thought these were really interesting sides about the human side of data science. DJ Patil, who was chief data scientist under the Obama administration, talked about citizen-driven data projects and how it enabled a number of advances. The most interesting case was basically a parent built an online community of people who had a very rare disease condition so he could help his son with the condition.

    Igor Perisic (of LinkedIn) followed this with a talk about ethical issues in data science. In particular, he identified three different areas to concentrate on: 1) The Ethics of Data, 2) The Ethics of Algorithms, and 3) The Ethics of practice. He concentrated on the recent New York Times article about using machine learning to identify potential re-offenders in the prison system. The lack of transparency in how the algorithm identifies potential reoffenders is a huge ethical problem.

    In all, I had an interesting time and I met lots of people in industry, which was a nice contrast to the academic side of things.

    Interesting useR 2017 Talks

    Since I didn’t get to go to useR 2017 this year, I’m compiling the interesting talks. This is an ongoing list.

    • https://user2017.sched.com/event/AxqM/automatically-archiving-reproducible-studies-with-docker
    • https://user2017.sched.com/event/Axq4/clouds-containers-and-r-towards-a-global-hub-for-reproducible-and-collaborative-data-science
    • https://user2017.sched.com/event/Axq9/scraping-data-with-rvest-and-purrr
    • https://user2017.sched.com/event/Axq1/using-the-alphabetr-package-to-determine-paired-t-cell-receptor-sequences
    • https://user2017.sched.com/event/AxqG/show-me-the-errors-you-didnt-look-for
    • https://user2017.sched.com/event/AxqR/community-based-learning-and-knowledge-sharing
    • https://user2017.sched.com/event/AxqT/r-based-computing-with-big-data-on-disk
    • https://user2017.sched.com/event/AxqA/codebookr-codebooks-in-r
    • https://user2017.sched.com/event/Axor/how-we-built-a-shiny-app-for-700-users Useful concepts: reactiveTrigger to force a rerender.
    • https://user2017.sched.com/event/AxsL/ensemble-packages-with-user-friendly-interface-an-added-value-for-the-r-community

    How to Not Be Afraid of Your Data

    I’m going to be giving a talk for the PDX RLang Meetup on July 11 called “How to Not Be Afraid of Your Data: Teaching EDA using Shiny”. Abstract below.

    Many graduate students in the basic sciences are afraid of data exploration and cleaning, which can greatly impact their downstream analysis results. By using a synthetic dataset, some simple dplyr commands, and a shiny dashboard, we teach graduate students how to explore their data and how to handle issues that can arise (missing values, differences in units). For this talk, we’ll run through a simple EDA example (combining two weight loss datasets) with a general data explorer in shiny that can be easily customized to teach specific EDA concepts.

    Some Lessons We Learned Running Cascadia-R

    Well, the first Cascadia R Conference has come and gone. I have to say that it was super fun, and well attended (over 190 people!). I had a blast meeting and chatting with everyone. Hopefully, we showed newbies that R is learnable and others that there are lots more things to learn about R.

    The following is my attempt to document what we learned from organizing Cascadia-R. It’s not complete; I may add and subtract from it as I think of more things to say about the planning process.

    Decide the tone. Our goals with Cascadia-R were modest. We wanted to get a diverse group of R users together in a safe and encouraging environment. We wanted our workshops to be accessible to even beginners, and encourage them in the use of R.

    Part of meeting these goals of this is setting the tone. We really wanted to encourage all levels of R users to attend. All of our flyers, emails and promotional tweets encouraged beginners to come. We got help with making a Code of Conduct for the conference. Part of creating a supportive environment is encouraging diversity in both speakers and attendees. We did our best to reach out to current groups that encourage diversity, such as Women in Science Portland, and R-Ladies Global.

    We also offered diversity scholarships to encourage people from diverse backgrounds to attend, and made diversity part of our criteria for selecting talks.

    Start planning early. As junior faculty at OHSU, I’m lucky enough to be able to book facilities here, including the large learning studios where we held the conference. Having the venue secured early on made the remaining logistics of the conference much easier.

    Much like wedding planning, there are plenty of conference planning services out there who would be happy to take over aspects of your conference, for a fee. You can spend however much you want to on these things. However, I believe that such a approach is not financially responsible. I also feel that taking a more DIY/bespoke approach can make a conference most engaging (see csvconf). We tried to do most things ourselves (including design, promotion, talk submission, workshops, and registration/logistics).

    Iterate your budget. Think of a conference as a project with lots of linked dependencies. Your first plan is probably not going to be your final plan. Start a plan, iterate, realize that things are going to shift, have a backup plan. What if registration is not going to pay for the venue rental fee? Talking to simpatico sponsors can take much of the financial stress. In our case, the Rstudio foundation and ROpenSci stepped up to contribute some money as a cushion.

    Remember, there are fixed costs (such as venue rental, and recording/streaming costs) and variable costs that scale with the number of attendees (food, badges, alcohol). Separate these out. When possible, pay off the fixed costs first, so that it’s easier to manage the variable costs.

    Again, who is your desired audience and can they afford your conference? We decided to make our conference as affordable as possible to encourage as many different kinds of people to attend. We initially wanted to make attendance free for students. The problem with free is that literally it’s free. It has no value in the mind of a person who accepts free admission. So we decided to charge students a small fee just to emphasize that the conference has value.

    Talk with others who have done it. We were very clueless about much of the logistics side at OHSU. I managed to get through by talking with a number of people here (including Robin Champieux and Shannon McWeeney) who have done conferences here at OHSU. Thank you so much for your invaluable advice.

    Encourage each other and delegate. No one of us could have done all of the conference planning alone. Each of us took on various aspects of conference organization and brought in the others as support as needed. Some of us selected talks, some of us did design, and we all pitched in to get registration working as efficiently and quickly as possible.

    Our slack channel on pdxdata.slack.com is full of our decisions. Slack was so useful as a planning mechanism that we only met online via Google Hangouts a few times, and only had two in-person planning sessions.

    Be Willing to Make Mistakes. Lord knows I made a bunch of mistakes when I made announcements and hosted the lightning sessions. However, I owned up to these mistakes, shrugged, and moved on. Improvising in the moment can be just as important as planning.

    Think about the future. What should the next Cascadia-R look like? I know it just happened, but we’re trying to envision what it would look like. Based on the feedback we’ve gotten so far, people really want more workshops!

    In a following post, I’m also going to talk about lessons I learned when Chester and I put on our tidyverse workshop.

    On Breadth and Depth in Your Academic Career

    I was talking with a student and they were complaining that when at conferences, they would try to inject other topics of interest (such as cooking) into discussions with colleagues. Unfortunately, one of the after effects of this was that they were looked at as “not a serious scientist”. There’s an expectation that a scientist must be all depth, only talking and thinking about their sub-field.

    As a cross disciplinarian, I have to say that is hogwash. The genesis of so many creative ideas in science has happened because of cross-pollination across disciplines. For example, microwave technology might never have been invented without the intersection of disciplines. We know that the Arts Foster Scientific Success - a large number of Nobel and National Academy members do art in some form or other. Bernstein et al theorize that

    “there exist functional connections between scientific talent and arts, crafts, and communications talents so that inheriting or developing one fosters the other.”

    Having breadth and depth enables you to make connections that no one else has. It is the hallmark of a curious and creative person. These kinds of people are desparately needed to push science in new directions.

    I have a parallel career in performance and improvisational music. Music, for me, is endlessly inspiring and has forced me out of my introverted shell. One of the reasons I took up cello is that I can play many roles; accompanist, rhythm, solo. This flexibility in playing music has translated to my flexibility in collaboration. Being able to adjust to new circumstances and improvise new ideas to explore is a critical component of being a responsible scientist. My background improvisation has helped me pivot ideas. I have become less attached to dogmatic ideas. Many of my good ideas come from idle wondering about data that has captured my imagination. This is part of the reason why I teach students how to explore their data.

    So, the next time another scientist looks down at you for being a polymath, pity them. Their world and their ideas are not as rich as yours.

    Further Reading

    Fostering a Peer Mentoring Culture

    I realize that it has been an embarrasingly long time since I updated this blog. I had all sorts of grandiose plans for it, and I think my problem was that I was thinking too broad, too pie-in-the-sky. I’m going to try to focus on short and informative blog posts.

    One of the things that I have been thinking about graduate school is the idea of building a Peer Mentoring culture in our department. I believe that students should help and support each other, and we need to provide a forum to do that. Not just assign mentors, but provide a time and a place to do that.

    We try to foster a mentoring culture within our student group, BioData-Club. Students are free to talk about issues that concern them, especially about datasets, and are encouraged to share their experiences of software that they’ve used. I believe that we try to give students a psychologically safe place to talk about their issues with data. We try to make people feel like they’re not alone, and coach beginners so they can get over the hump.

    We’re now embarking on an experiment to reach even more people at OHSU, because we know there are lots of students who struggle with practical skills in data analysis. Our group is growing, and that’s exciting.

    I’m going to try and get everyone in our group to write a paper about Peer Mentoring Culture and how to encourage it in other schools.

    Surrogate Oncogene Paper is Published

    My dissertation paper, A Network-Based Model of Oncogenic Collaboration for Prediction of Drug Sensitivity is now published! Here’s a lay summary:

    One outstanding issue in analyzing genomics in the context of personalized medicine is the incorporation of rare or infrequent genetic alterations (copy number alterations and somatic mutations) that are observed in individual > patients. We hypothesize that these mutations may actually ‘collaborate’ with known oncogenes in the genesis of tumors through their interactions. In order to show this effect, we assess whether these interacting rare mutations cluster around known oncogenes and assess these mutational clusters, which we term surrogate oncogenes. We assess their statistical significance using a simple model of mutation. We show that surrogate oncogenes are predictive of drug sensitivity in breast cancer cell lines. Additionally, they are prevalent in three different cancer cohorts (Breast, Glioblastoma, and Bladder Cancer) from The Cancer Genome Atlas. Within the Breast Cancer and Bladder Cancer populations, surrogate oncogenes are predictive of overall patient survival. The chief strength of the surrogate oncogene approach is that it can be run at a single-patient level in comparison to other methods of assessing mutational significance.

    If you’re interested in learning more, you can check out the Surrogate Oncogene Explorer in order to understand the nature of surrogate oncogenes, and my R/Bioconductor Package on GitHub if you’d like to try out the analysis.

    There’s a follow-up paper that I’m working on that I’m very excited about. More news soon.

    Why Short-Order Bioinformatics Doesn't Work

    Unfortunately, many researchers look at computational biology and bioinformatics as a black-box: you put in data, and you get results out. The bioinformaticians and computational biologists are seen as mostly computer operators who push the button and not as true collaborators. One of my co-workers calls this “short-order” bioinformatics.

    There is great danger in simply pushing a button to get results. One type of analysis, Gene Set Enrichment Analysis (GSEA) is highly dependent on how mutations are incorporated into a gene set. If done carelessly, the results can be spurious. One paper dependent on GSEA analysis was Dixson, et al. This paper, Identification of gene ontologies linked to prefrontal–hippocampal functional coupling in the human brain was retracted. A single SNP was assigned to 8 genes and was thus over counted. Their GSEA result of “synapse organization and biogenesis” was spurious due to this assignment.

    There is a lot of impatience from collaborators when results are not immediate. Understandably, much of this work is done to support a grant and there are always looming deadlines. However, there is a lot of work between a request and well-executed computational results. Potential collaborators need to be aware of these steps.

    A well-executed workflow is thus essential for the computational results to be valid. This may include the following steps.

    • Mapping of identifiers for entities for each platform to the appropriate gene construct. In the case of the SNP paper, it was appropriate assignment of SNPs to genes. However, with Systems Biology that integrate multiple OMICs types, this can include mapping protein isoforms to the mRNA transcripts if one is interested in the impact of alternative splicing. A clear strategy must be decided on and then executed.
    • Data Management Oftentimes, we need to work with the experimentalists who are executing the research in order to understand and identify potential confounders in the data. We do this by collected and integrating metadata into our analysis, that is, data about how the experiments were executed. We need to identify technical issues such as batch effects, and scheduling time with the experimentalists is our best way of identifying these potential issues.
    • Flagging of potentially spurious samples. This part of the process requires exploratory data analysis on gross measurements used in the high-throughput platforms. For example, we may visualize boxplots of mean expression for each sample to see if the expression levels can be compared.
    • Selection of the appropriate statistical protocol given the experimental design. This may require a couple of back and forths between the computational biologist and the researcher. A good computational biologist never assumes anything about the data or design.

    Without a well-mapped strategy of data cleaning, the results from any bioinformatics analysis may be suspect. A good bioinformatics collaborator will ask these questions and will not take no for an answer. Any information that you withhold from your collaborator will affect their analysis.

    In short, treating computational biology as a black-box is done at the researcher’s peril. Instead, a collaboration should be fostered. The best level of collaboration with computational biologists is to include them from the beginning, as part of the experimental design. This is obviously a greater level of commitment and time than simply considering them as a service core. However, the benefits and rewards are much greater at this level of collaboration.

    Interesting interview with the developer of statcheck

    Due to the usual postdoc busy-ness, I haven’t had the energy to update this blog as much as I would like, but I thought this interview on Retraction Watch from Michèle B. Nuijten, the developer of the R-package statcheck to be fascinating. Her package essentially automates the checking of p-values given published data in papers, from converting the papers from pdf to text, and sees if the calculated p-values are correct. There was a lot of trial and error in parsing known formats for p-values, but now the package is available.

    I see an potentially really interesting master’s thesis in forensic bioinformatics in using the package to assess reproducibility of results in a field. Note that the student probably wouldn’t make any friends in high places, but it would be a potentially high impact thesis.

    Somatic Mutations in Skin Paper

    This paper, High burden and pervasive positive selection of somatic mutations in normal human skin is fascinating. It suggests that the mutational burden is much higher than we expected in skin cells due to UV exposure. In addition, subclones exist in the skin that are positively selected for oncogenes.

    It also makes me want to stock up on sunscreen.

    High Impact Factor Journals Have Higher Retraction Rates

    Very interesting New York Times article about the rise of frauds and retractions in High Impact Factor journals. The retraction rates for High IF journals (such as Science, Cell, and Nature) are much higher than lower IF journals.

    From the article:

    Journals with higher impact factors retract papers more often than those with lower impact factors. It’s not clear why. It could be that these prominent periodicals have more, and more careful, readers, who notice mistakes. But there’s another explanation: Scientists view high-profile journals as the pinnacle of success — and they’ll cut corners, or worse, for a shot at glory.

    I would say that this is sad, but this is a consequence of the currently terrible funding climate and unreasonable expectations of study sections. If study sections dismiss grant writers because of an unreasonable expectation of past productivity, then it shouldn’t be surprising that the drive to make oneself look productive actively encourages fraud to get ahead.

subscribe via RSS