flowchart TD A[start ttyd] --> B[docker pull <br> from registry] B --> C[docker save to <br> snapshot file] C --> D[dx upload <br> snapshot to <br> project storage] D --> E[terminate ttyd]
10 Containers and Reproducibility
Make sure you are logged into the platform using dx login and that your course project is selected with dx select.
In your shell (either on your machine or in binder), make sure you’re in the bash_bioinfo_scripts/containers/ folder:
cd containers/
10.1 Learning Objectives
- Explain the benefits of using containers on DNAnexus for reproducibility and for batch processing
- Define the terms image, container, and snapshot in the context of Docker
- Create snapshots on RAP using
docker pullanddocker savewith the ttyd app - Utilize containers to batch process files on RAP
- Extend a docker image by installing within interactive mode
- Build a docker image using Dockerfiles
10.2 Why Containers?
There is a replication crisis out there. Even given a script and the raw data, it is often difficult to replicate the results generated by a study.
Why is this difficult? Many others have talked about this, but one simple reason is that the results are tied to software and database versions.
This is the motivation for using containers - they are a way of packaging software that ‘freezes’ the software versions. If you provide the container that you used to generate the results, other people should be able to replicate your results even if they’re on a different operating system.
10.3 Terminology
In order to be unambiguous with our language, we’ll use the following definitions:
- Registry - collection of repositories that you pull docker images from. Example repositories include DockerHub and Quay.io.
- Docker Image - what you download from a registry - the “recipe” for building the software environment. Stored in a registry. use
docker pullto get image,docker committo push changes to registry, can also generate image from a Dockerfile, - Docker Container - The executable software environment installed on a machine. Runnable. Generate from
docker pullfrom a repository. - Snapshot File - An single archive file (
.tar.gz) that contains the Docker container. Generate usingdocker saveon a container. Also known as an image file on the platform.
10.4 Building Docker Snapshot Files on the the DNAnexus platform
10.4.1 The Golden Rule of Docker and Batch Analysis
DockerHub has a pull limit of 200 pulls/day/user. You will face this limit a lot if you just use the image url.
So, if you are processing more than 200 files (or Jobs), you should save the docker image into platform storage as a snapshot file.
Let’s talk about the basic snapshot building process.
10.4.2 Be Secure
Before we get started, security is always a concern when running Docker images. The docker group has elevated status on a system, so we need to be careful that when we’re running them, they aren’t introducing any system vulnerabilities.
These are mostly important when running containers that are web-servers or part of a web stack, but it is also important to think about when running jobs on the cloud.
Here are some guidelines to think about when you are working with a container.
- Use vendor-specific Docker Images when possible.
- Use container scanners to spot potential vulnerabilities. DockerHub has a vulnerability scanner that scans your Docker images for potential vulnerabilities.
- Avoid kitchen-sink images. One issue is when an image is built on top of many other images. It makes it really difficult to plug vulnerabilities. When in doubt, use images from trusted people and organizations.
10.4.3 The Basic Snapshot Building Process
10.4.4 Building Snapshot Files in ttyd
Up until now, we have been using our own machine or the binder shell for doing our work.
We’re going to pull up a web-enabled shell on a DNAnexus worker with the ttyd app. ttyd is useful because:
dockeris already installed, so we candocker pullour container anddocker saveour snapshot to the ttyd instance.- It’s much faster to transfer our snapshot file back into project storage with
dx upload.
To open ttyd, open the Tool Library under Tools and select your project.
10.4.5 Pull your image from a registry
#| eval: false
docker pull quay.io/biocontainers/samtools:1.15.1--h1170115_0
On your ttyd instance, do a docker pull to pull your image from the registry. Note that we’re pulling samtools from quay.io here, from the biocontainers user.
We’re also specifying a version tag - the 1.15.1--h1170115_0 to tie our samtools to a specific version. This is important - most docker pull operations will pull from the latest tag, which is not tied to a specific version. So make sure to tie your image to a specific version.
When you’re done pulling the docker image, try out the docker images command.
docker images
10.4.6 Try your docker image out
Now that we have our docker image downloaded, we can test it out by running samtools --help. This should give us the help message.
#| eval: false
docker run biocontainers/samtools samtools --help
10.4.7 Save your docker image as a snapshot
Now that we’ve pulled the container, we are now going to save it as a snapshot file using docker save. We pipe the output of docker save into gzip to save it as samtools_image.tar.gz.
#| eval: false
docker save quay.io/biocontainers/samtools | gzip > samtools_image.tar.gz
10.4.8 Upload your snapshot
Now we can get our image back into project storage. We’ll create a folder called images/ with dx mkdir and then use dx upload to get our snapshot file into the images/ folder.
#| eval: false
dx mkdir images/
dx upload samtools_image.tar.gz --destination images/
10.4.9 Important: make sure to terminate your ttyd instance!
One thing to remember is that there is no timeout associated with ttyd. You will get a reminder email after it’s been open after 24 hours, but you will get no warning after that.
So make sure to use dx terminate or terminate the ttyd job under the Manage tab.
10.5 Using Docker with Swiss Army Knife
Now that we’ve built our Docker snapshot, let’s use it in Swiss Army Knife.
Swiss Army Knife has two separate inputs associated with Docker:
-iimage_file- This is where you put the snapshot file (such as thesamtools.tar.gz)-iimage- This is where you’d put the Docker URL (such asquay.io/ucsc_cgl/samtools)
So, let’s run a samtools job using our Docker snapshot.
#| eval: false
dx run app-swiss-army-knife \
-iimage_file="images/samtools.tar.gz" \
-iin="data/NA12878.bam"
-icmd="docker run samtools stats * > ${in_prefix}.stats.txt"
The main thing that has been changed here is that we’ve added an the -iimage_file input to our dx run statement.
10.6 Extending a Docker Image
One thing that you might do is extend a Docker image by adding additional software. You can do this by opening up an interactive mode and installing within the container.
What is interactive mode? When you pull a docker image in your ttyd session (Section 10.4.4), you can issue a docker run command with these options:
docker run -it ubuntu:18.04 /bin/bash
It will open up a bash shell in the container.
10.6.1 Pulling a Base Image
We’ll start out with the official ubuntu 18.04 container in our ttyd session:
#| eval: false
docker pull ubuntu:18.04
docker images
10.6.2 Open up interactive mode
In ttyd, now enter an interactive session:
docker run -it ubuntu:18.04 /bin/bash
If it works, you will open up a bash prompt in the container.
You’ll know you’re in the container if you do an ls and your filesystem looks different.
10.6.3 Install Software
Now, let’s install EMBOSS (European Molecular Biology Open Software Suite), which is a suite of string utilities for working with genomic data. If you look at the EMBOSS link, you will see that you can install it via apt install, which is available by default in the ubuntu container.
#| eval: false
apt update && apt upgrade
apt install emboss gzip -y
10.6.4 Exit Container
Now exit from your container’s interactive mode:
#| eval: false
exit
You’ll be back at the normal ttyd prompt.
10.6.5 docker commit/docker save your new snapshot file
We created a new container when we installed everything. We’ll need to find it its ID in ttyd.
#| eval: false
docker ps -a
We can see that our new container has the following id. We can use this id to save a new container with docker commit. Now we can save the snapshot file by using docker save:
#| eval: false
docker commit <container_id> emboss:6.6.0
docker save emboss:6.6.0 | gzip > emboss.tar.gz
dx upload emboss.tar.gz --destination images/
10.6.6 Other uses of Interactive Mode
Docker’s interactive mode is really helpful for testing out scripts and making sure they are reproducible.
If I have a one-off analysis, it may be faster for me to just open up ttyd and use docker run to open up interactive mode, and do work with a container.
10.7 Making Dockerfiles
The other way to build image files is to use a Dockerfile. A Dockerfile is a recipe for installing software and its dependencies.
Let’s take a look at a Dockerfile. By default, it is contained within a folder and is called Dockerfile:
FROM ubuntu:18.04
RUN apt-get update && \
apt-get install -y build-essential && \
apt-get install -y wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH
#install plink with conda
RUN conda install -c "bioconda/label/cf201901" plink
RUN conda install -c "bioconda/label/cf201901" samtools
We can build the Docker image in our directory using:
docker build . -t gatk_sam_plink:0.0.1
When it’s done, we can then make sure it’s been built by using
docker images
And we can use it like any other image.
10.8 Going Further with Docker
Now that you know how to build a snapshot file, you’ve also learned another step in building apps: specifying software dependencies. You can use these snapshot files to specify executables in your app.
You can also use these snapshot files in your WDL workflow.
10.9 What you learned in this chapter
- How containers enable reproducibility
- Defined specific container terminology
- Created snapshot files using
ttyd - Use these snapshot files with Swiss Army Knife
- How to extend a docker image by installing new software