flowchart LR A[1. Project Storage\nInput files] -->|dx download\n dxFUSE|C[2.Worker Storage\nOutput files] C --->|dx upload|A
5 Using JupyterLab on DNAnexus
5.1 Learning Objectives
- Explain the relationship between JupyterLab and a DNAnexus project
- Request the appropriate JupyterLab instance given your workflow requirements
- Download files and Jupyter Notebooks from a project into the JupyterLab environment
- Run a Jupyter Notebook in the JupyterLab environment and upload it back into the project storage
- Utilize and install software dependencies into the JupyterLab environment
5.2 Why JupyterLabs?
If you are reading this, you are probably familiar with literate programming and notebook based analyses.
Oftentimes, when building a data processing pipeline, training a machine learner, or exploring datasets, it really only makes sense in an interactive manner. Transform data, visualize, repeat.
5.3 Use Cases for JupyterLab
It should be noted that these configurations are just starting points. When you have access to the worker, you’re able to install most software packages via Ubuntu methods (such as apt install
) or by package installers such as pip
(Python) or install.packages()
(R).
We’ll learn in a little bit about Snapshotting, which will let you install packages once, save it as a JupyterLab image file, and launch future instances of JupyterLab with your software environment already installed.
5.4 Launching JupyterLab
From the command line, we can launch JupyterLab using the following command:
dx run dxjupyterlab -y -brief -ifeature="PYTHON_R"
If we have a snapshot in our project (Section 5.11), we can specify it with the -isnapshot
argument:
dx run dxjupyterlab -y -brief -ifeature="PYTHON_R" -isnapshot=project-YYYYYY:file-XXXXXX
Take a note of the job-id that is returned when you start your job. That’s going to be the basis for the URL you’ll use to access the JupyterLab instance: job-ZZZZZZZ.dnanexus.cloud.
It will take a few minutes for JupyterLab to start up, even after the status of our job is Running. I usually grab a snack and then come back.
5.5 Two kinds of Storage
We have talked about the multiple storage systems we need to contend with to be successful on the platform.
Let’s focus on the Project Storage and the temporary worker storage. When we work with JupyterLabs, we need to contend with both.
- Project Storage (permanent). This is where our input files (such as VCF or BAM files) live. They are transferred to the Worker Storage with two methods:
dx download
or using thedxFUSE
file system. - Worker Storage (temporary). We take our input files and process them here to produce output files. We can only use
dx upload
to transfer our output files.
5.6 The two filesystems
The two filesystems are accessed with the two tabs on the left sidebar. The first is indicated by the folder icon, which represents the temporary storage.
The other tab represents the project storage where you currently running your instance of JupyterLab.
5.7 How to not lose your work
The main reason I bring up these two filesystems is because of this: If you have started a notebook on the temporary worker, you need to upload it back into the project using dx upload
.
In the words of Nintendo: Everything not saved will be lost.
The much safer way to work in a project is to use DNAnexus >> New Notebook in the JupyterLab menu which will create a new notebook in project storage. There is an autosave feature for these notebooks, but when in doubt, save often.
You can identify in the JupyterLab interface which notebooks are being accessed from project storage by the [DX]
in their title.
5.8 A Basic JupyterLabs Workflow
Let’s integrate this knowledge by showing a basic notebook workflow in the JupyterLab app.
5.8.1 Download files to worker storage
import pandas as pd
1/penguins.csv dx download data
- 1
-
You can also use a file-id here (such as
project-YYYYYYY:file-XXXXXXXX
) instead of a file path here.
Now the file should be available in our local worker storage.
5.8.2 Load files from worker storage
2= pd.read_csv("penguins.csv") penguins
- 2
-
Now that the file is downloaded into our temporary storage, we can load it using
pd.read_csv
.
5.8.3 Do your Work
3 penguins.describe()
- 3
-
We can do any work we need to now that our data is loaded as a Pandas DataFrame. Here we do a
.describe()
to get some descriptive statistics on our numeric columns.
5.8.4 Save any results into project storage
4--destination /users/tladeras/ dx upload penguins.csv
- 4
-
Say we made a modification to
penguins
in our work. We can get that result back into project storage usingdx upload
. Note that with the--destination
parameter, directories will be created on the platform if they do not yet exist.
5.9 An alternate way of transferring files: dxFUSE
import pandas as pd
1= pd.read_csv("/mnt/project/data/penguins.csv")
penguins2
penguins2.describe()"penguins2.csv")
penguins2.to_csv(2 dx upload penguins2.csv
- 1
-
We can skip two steps by prepending a
/mnt/project/
to our datafile’s path and using this new path (/mnt/project/data/penguins.csv
) directly inpd.read_csv()
. - 2
-
Uploading modified file (
penguins2.csv
) usingdx upload
as usual.
I talk much more about dxFUSE in Section 6.8
5.10 Installing Software in JupyterLab
Because we have sudo level access to our JupyterLab instance, we can install software on it with a number of methods:
apt install
. Make sure to runapt update
before you try to install packages.- Miniconda (
conda install
) - you’ll have to install Miniconda from the script. pip install
(Python)install.packages()
(R)docker load
- we can load Docker images into our JupyterLab instance. We can then run them
Where possible, use tags (Docker) and version numbers to install specific packages. For example, for R, we can use the following:
install.packages("ggplot2", version='0.9.1')
For Python, we can specify version number with pip install
with a double equals (==
):
pip install open-cravat==2.4.2
When possible, I try to install software with either a shell script or a Jupyter Notebook. I like having a script to do this because it is very clear what is being installed and the versions.
5.11 Snapshotting: Saving the Software Environment for Reuse
Do we have to reinstall software every time we run JupyterLab?
It’s a relief that we don’t have to. We can save a JupyterLab Snapshot to project storage.
Once we’ve installed software via any of the above processes, we can use DNAnexus >> Create Snapshot to save our snapshot into project storage. This will be created in the .Notebook_Snapshot
folder.
When we restart JupyterLab, we can specify this snapshot when we start it up.
5.12 Working with Pheno Data in JupyterLab
There are two main ways to access the phenotype data in JupyterLab:
- Use
dx extract_dataset
using the dx toolkit (available in JupyterLab in the terminal) and supply it with a record-id for our dataset/cohort, and a list ofentity.field_ids
. This will extract the dataset to a CSV file that is available in the temporary storage. You will need to do further decoding of the category data. - Use the
table-exporter
app, and supply it with the record id and a list of field titles. This will extract the dataset to the permanent project storage. If you have specified-icoding_option
to beREPLACE
(the default value), you will not have to decode the categorical values.
Table 5.1 shows a table comparison of the two methods:
Step | dx extract_dataset |
table-exporter app |
---|---|---|
Run | Run in JupyterLab Terminal | Run as application |
Output Location | Temporary JupyterLab storage | Permanent Project Storage |
Format | CSV or TSV | CSV (default) but can specify delimiter |
Decoding? | Yes, necessary | No, if -icoding-option is set |
5.13 Extracting a case-control set
When we do