5  Using JupyterLab on DNAnexus

5.1 Learning Objectives

  1. Explain the relationship between JupyterLab and a DNAnexus project
  2. Request the appropriate JupyterLab instance given your workflow requirements
  3. Download files and Jupyter Notebooks from a project into the JupyterLab environment
  4. Run a Jupyter Notebook in the JupyterLab environment and upload it back into the project storage
  5. Utilize and install software dependencies into the JupyterLab environment

5.2 Why JupyterLabs?

If you are reading this, you are probably familiar with literate programming and notebook based analyses.

Oftentimes, when building a data processing pipeline, training a machine learner, or exploring datasets, it really only makes sense in an interactive manner. Transform data, visualize, repeat.

5.3 Use Cases for JupyterLab

flowchart TD
    A{Large Scale\nAnalysis?\n} -->|no|B[Single JupyterLab]
    A --->|yes|C[Spark JupyterLab]
Figure 5.1: The use cases for Jupyter Lab.
flowchart TD
    B[Single JupyterLab]
    B --->D[Python/R]
    B --->E[Stata]
    B --->F[Machine Learning\nGPU]
    B --->G[Image Processing\nGPU]
Figure 5.2: The use cases for single-node Jupyter Lab.
flowchart TD
    C[Spark JupyterLab]
    C --->H[HAIL]
    C --->I[GLOW]
    C --->J[HAIL/VEP]
Figure 5.3: The use cases for Spark Jupyter Lab.

It should be noted that these configurations are just starting points. When you have access to the worker, you’re able to install most software packages via Ubuntu methods (such as apt install) or by package installers such as pip (Python) or install.packages() (R).

We’ll learn in a little bit about Snapshotting, which will let you install packages once, save it as a JupyterLab image file, and launch future instances of JupyterLab with your software environment already installed.

5.4 Launching JupyterLab

From the command line, we can launch JupyterLab using the following command:

dx run dxjupyterlab -y -brief -ifeature="PYTHON_R"

If we have a snapshot in our project (Section 5.11), we can specify it with the -isnapshot argument:

dx run dxjupyterlab -y -brief -ifeature="PYTHON_R" -isnapshot=project-YYYYYY:file-XXXXXX

Take a note of the job-id that is returned when you start your job. That’s going to be the basis for the URL you’ll use to access the JupyterLab instance: job-ZZZZZZZ.dnanexus.cloud.

It will take a few minutes for JupyterLab to start up, even after the status of our job is Running. I usually grab a snack and then come back.

5.5 Two kinds of Storage

We have talked about the multiple storage systems we need to contend with to be successful on the platform.

Let’s focus on the Project Storage and the temporary worker storage. When we work with JupyterLabs, we need to contend with both.

flowchart LR
    A[1. Project Storage\nInput files] -->|dx download\n dxFUSE|C[2.Worker Storage\nOutput files]
    C --->|dx upload|A
  1. Project Storage (permanent). This is where our input files (such as VCF or BAM files) live. They are transferred to the Worker Storage with two methods: dx download or using the dxFUSE file system.
  2. Worker Storage (temporary). We take our input files and process them here to produce output files. We can only use dx upload to transfer our output files.

5.6 The two filesystems

The two filesystems are accessed with the two tabs on the left sidebar. The first is indicated by the folder icon, which represents the temporary storage.

The other tab represents the project storage where you currently running your instance of JupyterLab.

5.7 How to not lose your work

The main reason I bring up these two filesystems is because of this: If you have started a notebook on the temporary worker, you need to upload it back into the project using dx upload.

In the words of Nintendo: Everything not saved will be lost.

The much safer way to work in a project is to use DNAnexus >> New Notebook in the JupyterLab menu which will create a new notebook in project storage. There is an autosave feature for these notebooks, but when in doubt, save often.

You can identify in the JupyterLab interface which notebooks are being accessed from project storage by the [DX] in their title.

Wait, I thought that files were immutable on the platform. Not notebooks?

If you know a little bit about the DNAnexus filesystem, you might be wondering about the notebook saving into project storage. If you know that file objects are immutable, you might be wondering what’s going on.

The secret is that old versions of notebooks (including autosaved ones) get archived into a folder callled .Notebook_archive/ in your project. These are all time stamped.

5.8 A Basic JupyterLabs Workflow

Let’s integrate this knowledge by showing a basic notebook workflow in the JupyterLab app.

5.8.1 Download files to worker storage

import pandas as pd
1dx download data/penguins.csv
You can also use a file-id here (such as project-YYYYYYY:file-XXXXXXXX) instead of a file path here.

Now the file should be available in our local worker storage.

5.8.2 Load files from worker storage

2penguins = pd.read_csv("penguins.csv")
Now that the file is downloaded into our temporary storage, we can load it using pd.read_csv.

5.8.3 Do your Work

We can do any work we need to now that our data is loaded as a Pandas DataFrame. Here we do a .describe() to get some descriptive statistics on our numeric columns.

5.8.4 Save any results into project storage

4dx upload penguins.csv --destination /users/tladeras/
Say we made a modification to penguins in our work. We can get that result back into project storage using dx upload. Note that with the --destination parameter, directories will be created on the platform if they do not yet exist.

5.9 An alternate way of transferring files: dxFUSE

import pandas as pd
1penguins2 = pd.read_csv("/mnt/project/data/penguins.csv")
2dx upload penguins2.csv
We can skip two steps by prepending a /mnt/project/ to our datafile’s path and using this new path (/mnt/project/data/penguins.csv) directly in pd.read_csv().
Uploading modified file (penguins2.csv) using dx upload as usual.

I talk much more about dxFUSE in Section 6.8

5.10 Installing Software in JupyterLab

Because we have sudo level access to our JupyterLab instance, we can install software on it with a number of methods:

  1. apt install. Make sure to run apt update before you try to install packages.
  2. Miniconda (conda install) - you’ll have to install Miniconda from the script.
  3. pip install (Python)
  4. install.packages() (R)
  5. docker load - we can load Docker images into our JupyterLab instance. We can then run them

Where possible, use tags (Docker) and version numbers to install specific packages. For example, for R, we can use the following:

install.packages("ggplot2", version='0.9.1')

For Python, we can specify version number with pip install with a double equals (==):

pip install open-cravat==2.4.2

When possible, I try to install software with either a shell script or a Jupyter Notebook. I like having a script to do this because it is very clear what is being installed and the versions.

5.11 Snapshotting: Saving the Software Environment for Reuse

Do we have to reinstall software every time we run JupyterLab?

It’s a relief that we don’t have to. We can save a JupyterLab Snapshot to project storage.

Once we’ve installed software via any of the above processes, we can use DNAnexus >> Create Snapshot to save our snapshot into project storage. This will be created in the .Notebook_Snapshot folder.

When we restart JupyterLab, we can specify this snapshot when we start it up.

Snapshotting dos and don’ts

It’s preferable not to save data in a snapshot. This is because you get doubly charged for storing the same data.

If you need data files to persist in an analysis, I recommend you get them back into project storage using dx upload.

Also, make sure to rename your snapshot so everyone in your group knows what is in it.

5.12 Working with Pheno Data in JupyterLab

There are two main ways to access the phenotype data in JupyterLab:

  1. Use dx extract_dataset using the dx toolkit (available in JupyterLab in the terminal) and supply it with a record-id for our dataset/cohort, and a list of entity.field_ids. This will extract the dataset to a CSV file that is available in the temporary storage. You will need to do further decoding of the category data.
  2. Use the table-exporter app, and supply it with the record id and a list of field titles. This will extract the dataset to the permanent project storage. If you have specified -icoding_option to be REPLACE (the default value), you will not have to decode the categorical values.

Table 5.1 shows a table comparison of the two methods:

Table 5.1: Comparison of dx extract_dataset versus table-exporter.
Step dx extract_dataset table-exporter app
Run Run in JupyterLab Terminal Run as application
Output Location Temporary JupyterLab storage Permanent Project Storage
Format CSV or TSV CSV (default) but can specify delimiter
Decoding? Yes, necessary No, if -icoding-option is set

5.13 Extracting a case-control set

When we do