10  Containers and Reproducibility

Prep for Exercises

Make sure you are logged into the platform using dx login and that your course project is selected with dx select.

In your shell (either on your machine or in binder), make sure you’re in the bash_bioinfo_scripts/containers/ folder:

cd containers/

10.1 Learning Objectives

  1. Explain the benefits of using containers on DNAnexus for reproducibility and for batch processing
  2. Define the terms image, container, and snapshot in the context of Docker
  3. Create snapshots on RAP using docker pull and docker save with the ttyd app
  4. Utilize containers to batch process files on RAP
  5. Extend a docker image by installing within interactive mode
  6. Build a docker image using Dockerfiles

10.2 Why Containers?

There is a replication crisis out there. Even given a script and the raw data, it is often difficult to replicate the results generated by a study.

Why is this difficult? Many others have talked about this, but one simple reason is that the results are tied to software and database versions.

This is the motivation for using containers - they are a way of packaging software that ‘freezes’ the software versions. If you provide the container that you used to generate the results, other people should be able to replicate your results even if they’re on a different operating system.

10.3 Terminology

In order to be unambiguous with our language, we’ll use the following definitions:

Figure 10.1: Docker Terms 1
  • Registry - collection of repositories that you pull docker images from. Example repositories include DockerHub and Quay.io.
  • Docker Image - what you download from a registry - the “recipe” for building the software environment. Stored in a registry. use docker pull to get image, docker commit to push changes to registry, can also generate image from a Dockerfile,
  • Docker Container - The executable software environment installed on a machine. Runnable. Generate from docker pull from a repository.
  • Snapshot File - An single archive file (.tar.gz) that contains the Docker container. Generate using docker save on a container. Also known as an image file on the platform.

10.4 Building Docker Snapshot Files on the the DNAnexus platform

10.4.1 The Golden Rule of Docker and Batch Analysis

DockerHub has a pull limit of 200 pulls/day/user. You will face this limit a lot if you just use the image url.

So, if you are processing more than 200 files (or Jobs), you should save the docker image into platform storage as a snapshot file.

Let’s talk about the basic snapshot building process.

10.4.2 Be Secure

Before we get started, security is always a concern when running Docker images. The docker group has elevated status on a system, so we need to be careful that when we’re running them, they aren’t introducing any system vulnerabilities.

These are mostly important when running containers that are web-servers or part of a web stack, but it is also important to think about when running jobs on the cloud.

Here are some guidelines to think about when you are working with a container.

  • Use vendor-specific Docker Images when possible.
  • Use container scanners to spot potential vulnerabilities. DockerHub has a vulnerability scanner that scans your Docker images for potential vulnerabilities.
  • Avoid kitchen-sink images. One issue is when an image is built on top of many other images. It makes it really difficult to plug vulnerabilities. When in doubt, use images from trusted people and organizations.

10.4.3 The Basic Snapshot Building Process

flowchart TD
  A[start ttyd] --> B[docker pull <br> from registry]
  B --> C[docker save to <br> snapshot file]
  C --> D[dx upload <br> snapshot to <br> project storage]
  D --> E[terminate ttyd]
Figure 10.2: Building a docker snapshot on the DNAnexus platform.

10.4.4 Building Snapshot Files in ttyd

Up until now, we have been using our own machine or the binder shell for doing our work.

We’re going to pull up a web-enabled shell on a DNAnexus worker with the ttyd app. ttyd is useful because:

  1. docker is already installed, so we can docker pull our container and docker save our snapshot to the ttyd instance.
  2. It’s much faster to transfer our snapshot file back into project storage with dx upload.

To open ttyd, open the Tool Library under Tools and select your project.

Opening ttyd

10.4.5 Pull your image from a registry

#| eval: false
docker pull quay.io/biocontainers/samtools:1.15.1--h1170115_0

On your ttyd instance, do a docker pull to pull your image from the registry. Note that we’re pulling samtools from quay.io here, from the biocontainers user.

We’re also specifying a version tag - the 1.15.1--h1170115_0 to tie our samtools to a specific version. This is important - most docker pull operations will pull from the latest tag, which is not tied to a specific version. So make sure to tie your image to a specific version.

When you’re done pulling the docker image, try out the docker images command.

docker images

10.4.6 Try your docker image out

Now that we have our docker image downloaded, we can test it out by running samtools --help. This should give us the help message.

#| eval: false
docker run biocontainers/samtools samtools --help

10.4.7 Save your docker image as a snapshot

Now that we’ve pulled the container, we are now going to save it as a snapshot file using docker save. We pipe the output of docker save into gzip to save it as samtools_image.tar.gz.

#| eval: false

docker save quay.io/biocontainers/samtools | gzip > samtools_image.tar.gz 

10.4.8 Upload your snapshot

Now we can get our image back into project storage. We’ll create a folder called images/ with dx mkdir and then use dx upload to get our snapshot file into the images/ folder.

#| eval: false
dx mkdir images/
dx upload samtools_image.tar.gz --destination images/

10.4.9 Important: make sure to terminate your ttyd instance!

One thing to remember is that there is no timeout associated with ttyd. You will get a reminder email after it’s been open after 24 hours, but you will get no warning after that.

So make sure to use dx terminate or terminate the ttyd job under the Manage tab.

10.5 Using Docker with Swiss Army Knife

Now that we’ve built our Docker snapshot, let’s use it in Swiss Army Knife.

Swiss Army Knife has two separate inputs associated with Docker:

  • -iimage_file - This is where you put the snapshot file (such as the samtools.tar.gz)
  • -iimage - This is where you’d put the Docker URL (such as quay.io/ucsc_cgl/samtools)

So, let’s run a samtools job using our Docker snapshot.

#| eval: false
dx run app-swiss-army-knife \
  -iimage_file="images/samtools.tar.gz" \
  -iin="data/NA12878.bam"
  -icmd="docker run samtools stats * > ${in_prefix}.stats.txt"

The main thing that has been changed here is that we’ve added an the -iimage_file input to our dx run statement.

10.6 Extending a Docker Image

One thing that you might do is extend a Docker image by adding additional software. You can do this by opening up an interactive mode and installing within the container.

What is interactive mode? When you pull a docker image in your ttyd session (Section 10.4.4), you can issue a docker run command with these options:

docker run -it ubuntu:18.04 /bin/bash

It will open up a bash shell in the container.

10.6.1 Pulling a Base Image

We’ll start out with the official ubuntu 18.04 container in our ttyd session:

#| eval: false
docker pull ubuntu:18.04
docker images

10.6.2 Open up interactive mode

In ttyd, now enter an interactive session:

docker run -it ubuntu:18.04 /bin/bash

If it works, you will open up a bash prompt in the container.

You’ll know you’re in the container if you do an ls and your filesystem looks different.

10.6.3 Install Software

Now, let’s install EMBOSS (European Molecular Biology Open Software Suite), which is a suite of string utilities for working with genomic data. If you look at the EMBOSS link, you will see that you can install it via apt install, which is available by default in the ubuntu container.

#| eval: false
apt update && apt upgrade
apt install emboss gzip -y

10.6.4 Exit Container

Now exit from your container’s interactive mode:

#| eval: false
exit

You’ll be back at the normal ttyd prompt.

10.6.5 docker commit/docker save your new snapshot file

We created a new container when we installed everything. We’ll need to find it its ID in ttyd.

#| eval: false
docker ps -a

We can see that our new container has the following id. We can use this id to save a new container with docker commit. Now we can save the snapshot file by using docker save:

#| eval: false
docker commit <container_id> emboss:6.6.0
docker save emboss:6.6.0 | gzip > emboss.tar.gz
dx upload emboss.tar.gz --destination images/

10.6.6 Other uses of Interactive Mode

Docker’s interactive mode is really helpful for testing out scripts and making sure they are reproducible.

If I have a one-off analysis, it may be faster for me to just open up ttyd and use docker run to open up interactive mode, and do work with a container.

10.7 Making Dockerfiles

The other way to build image files is to use a Dockerfile. A Dockerfile is a recipe for installing software and its dependencies.

Let’s take a look at a Dockerfile. By default, it is contained within a folder and is called Dockerfile:

FROM ubuntu:18.04

RUN apt-get update && \
    apt-get install -y build-essential  && \
    apt-get install -y wget && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
     /bin/bash ~/miniconda.sh -b -p /opt/conda

ENV PATH=$CONDA_DIR/bin:$PATH

#install plink with conda
RUN conda install -c "bioconda/label/cf201901" plink
RUN conda install -c "bioconda/label/cf201901" samtools

We can build the Docker image in our directory using:

docker build . -t gatk_sam_plink:0.0.1

When it’s done, we can then make sure it’s been built by using

docker images

And we can use it like any other image.

10.8 Going Further with Docker

Now that you know how to build a snapshot file, you’ve also learned another step in building apps: specifying software dependencies. You can use these snapshot files to specify executables in your app.

You can also use these snapshot files in your WDL workflow.

10.9 What you learned in this chapter

  • How containers enable reproducibility
  • Defined specific container terminology
  • Created snapshot files using ttyd
  • Use these snapshot files with Swiss Army Knife
  • How to extend a docker image by installing new software