10 Containers and Reproducibility
Make sure you are logged into the platform using dx login
and that your course project is selected with dx select
.
In your shell (either on your machine or in binder), make sure you’re in the bash_bioinfo_scripts/containers/
folder:
cd containers/
10.1 Learning Objectives
- Explain the benefits of using containers on DNAnexus for reproducibility and for batch processing
- Define the terms image, container, and snapshot in the context of Docker
- Create snapshots on RAP using
docker pull
anddocker save
with the ttyd app - Utilize containers to batch process files on RAP
- Extend a docker image by installing within interactive mode
- Build a docker image using Dockerfiles
10.2 Why Containers?
There is a replication crisis out there. Even given a script and the raw data, it is often difficult to replicate the results generated by a study.
Why is this difficult? Many others have talked about this, but one simple reason is that the results are tied to software and database versions.
This is the motivation for using containers - they are a way of packaging software that ‘freezes’ the software versions. If you provide the container that you used to generate the results, other people should be able to replicate your results even if they’re on a different operating system.
10.3 Terminology
In order to be unambiguous with our language, we’ll use the following definitions:
- Registry - collection of repositories that you pull docker images from. Example repositories include DockerHub and Quay.io.
- Docker Image - what you download from a registry - the “recipe” for building the software environment. Stored in a registry. use
docker pull
to get image,docker commit
to push changes to registry, can also generate image from a Dockerfile, - Docker Container - The executable software environment installed on a machine. Runnable. Generate from
docker pull
from a repository. - Snapshot File - An single archive file (
.tar.gz
) that contains the Docker container. Generate usingdocker save
on a container. Also known as an image file on the platform.
10.4 Building Docker Snapshot Files on the the DNAnexus platform
10.4.1 The Golden Rule of Docker and Batch Analysis
DockerHub has a pull limit of 200 pulls/day/user. You will face this limit a lot if you just use the image url.
So, if you are processing more than 200 files (or Jobs), you should save the docker image into platform storage as a snapshot file.
Let’s talk about the basic snapshot building process.
10.4.2 Be Secure
Before we get started, security is always a concern when running Docker images. The docker
group has elevated status on a system, so we need to be careful that when we’re running them, they aren’t introducing any system vulnerabilities.
These are mostly important when running containers that are web-servers or part of a web stack, but it is also important to think about when running jobs on the cloud.
Here are some guidelines to think about when you are working with a container.
- Use vendor-specific Docker Images when possible.
- Use container scanners to spot potential vulnerabilities. DockerHub has a vulnerability scanner that scans your Docker images for potential vulnerabilities.
- Avoid kitchen-sink images. One issue is when an image is built on top of many other images. It makes it really difficult to plug vulnerabilities. When in doubt, use images from trusted people and organizations.
10.4.3 The Basic Snapshot Building Process
10.4.4 Building Snapshot Files in ttyd
Up until now, we have been using our own machine or the binder shell for doing our work.
We’re going to pull up a web-enabled shell on a DNAnexus worker with the ttyd
app. ttyd
is useful because:
docker
is already installed, so we candocker pull
our container anddocker save
our snapshot to the ttyd instance.- It’s much faster to transfer our snapshot file back into project storage with
dx upload
.
To open ttyd, open the Tool Library under Tools and select your project.
10.4.5 Pull your image from a registry
#| eval: false
docker pull quay.io/biocontainers/samtools:1.15.1--h1170115_0
On your ttyd
instance, do a docker pull
to pull your image from the registry. Note that we’re pulling samtools
from quay.io
here, from the biocontainers
user.
We’re also specifying a version tag - the 1.15.1--h1170115_0
to tie our samtools
to a specific version. This is important - most docker pull
operations will pull from the latest
tag, which is not tied to a specific version. So make sure to tie your image to a specific version.
When you’re done pulling the docker image, try out the docker images
command.
docker images
10.4.6 Try your docker image out
Now that we have our docker image downloaded, we can test it out by running samtools --help
. This should give us the help message.
#| eval: false
docker run biocontainers/samtools samtools --help
10.4.7 Save your docker image as a snapshot
Now that we’ve pulled the container, we are now going to save it as a snapshot file using docker save
. We pipe the output of docker save
into gzip
to save it as samtools_image.tar.gz
.
#| eval: false
docker save quay.io/biocontainers/samtools | gzip > samtools_image.tar.gz
10.4.8 Upload your snapshot
Now we can get our image back into project storage. We’ll create a folder called images/
with dx mkdir
and then use dx upload
to get our snapshot file into the images/
folder.
#| eval: false
dx mkdir images/
dx upload samtools_image.tar.gz --destination images/
10.4.9 Important: make sure to terminate your ttyd instance!
One thing to remember is that there is no timeout associated with ttyd
. You will get a reminder email after it’s been open after 24 hours, but you will get no warning after that.
So make sure to use dx terminate
or terminate the ttyd job under the Manage
tab.
10.5 Using Docker with Swiss Army Knife
Now that we’ve built our Docker snapshot, let’s use it in Swiss Army Knife.
Swiss Army Knife has two separate inputs associated with Docker:
-iimage_file
- This is where you put the snapshot file (such as thesamtools.tar.gz
)-iimage
- This is where you’d put the Docker URL (such asquay.io/ucsc_cgl/samtools
)
So, let’s run a samtools
job using our Docker snapshot.
#| eval: false
dx run app-swiss-army-knife \
-iimage_file="images/samtools.tar.gz" \
-iin="data/NA12878.bam"
-icmd="docker run samtools stats * > ${in_prefix}.stats.txt"
The main thing that has been changed here is that we’ve added an the -iimage_file
input to our dx run
statement.
10.6 Extending a Docker Image
One thing that you might do is extend a Docker image by adding additional software. You can do this by opening up an interactive mode and installing within the container.
What is interactive mode? When you pull a docker image in your ttyd
session (Section 10.4.4), you can issue a docker run
command with these options:
docker run -it ubuntu:18.04 /bin/bash
It will open up a bash shell in the container.
10.6.1 Pulling a Base Image
We’ll start out with the official ubuntu 18.04 container in our ttyd session:
#| eval: false
docker pull ubuntu:18.04
docker images
10.6.2 Open up interactive mode
In ttyd, now enter an interactive session:
docker run -it ubuntu:18.04 /bin/bash
If it works, you will open up a bash
prompt in the container.
You’ll know you’re in the container if you do an ls
and your filesystem looks different.
10.6.3 Install Software
Now, let’s install EMBOSS (European Molecular Biology Open Software Suite), which is a suite of string utilities for working with genomic data. If you look at the EMBOSS link, you will see that you can install it via apt install
, which is available by default in the ubuntu
container.
#| eval: false
apt update && apt upgrade
apt install emboss gzip -y
10.6.4 Exit Container
Now exit from your container’s interactive mode:
#| eval: false
exit
You’ll be back at the normal ttyd prompt.
10.6.5 docker commit
/docker save
your new snapshot file
We created a new container when we installed everything. We’ll need to find it its ID in ttyd.
#| eval: false
docker ps -a
We can see that our new container has the following id. We can use this id to save a new container with docker commit
. Now we can save the snapshot file by using docker save
:
#| eval: false
docker commit <container_id> emboss:6.6.0
docker save emboss:6.6.0 | gzip > emboss.tar.gz
dx upload emboss.tar.gz --destination images/
10.6.6 Other uses of Interactive Mode
Docker’s interactive mode is really helpful for testing out scripts and making sure they are reproducible.
If I have a one-off analysis, it may be faster for me to just open up ttyd
and use docker run
to open up interactive mode, and do work with a container.
10.7 Making Dockerfiles
The other way to build image files is to use a Dockerfile. A Dockerfile is a recipe for installing software and its dependencies.
Let’s take a look at a Dockerfile. By default, it is contained within a folder and is called Dockerfile
:
FROM ubuntu:18.04
RUN apt-get update && \
apt-get install -y build-essential && \
apt-get install -y wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH
#install plink with conda
RUN conda install -c "bioconda/label/cf201901" plink
RUN conda install -c "bioconda/label/cf201901" samtools
We can build the Docker image in our directory using:
docker build . -t gatk_sam_plink:0.0.1
When it’s done, we can then make sure it’s been built by using
docker images
And we can use it like any other image.
10.8 Going Further with Docker
Now that you know how to build a snapshot file, you’ve also learned another step in building apps: specifying software dependencies. You can use these snapshot files to specify executables in your app.
You can also use these snapshot files in your WDL workflow.
10.9 What you learned in this chapter
- How containers enable reproducibility
- Defined specific container terminology
- Created snapshot files using
ttyd
- Use these snapshot files with Swiss Army Knife
- How to extend a docker image by installing new software