9 Container Basics

9.1 Learning Objectives

Explain the benefits of using containers for reproducibility and for batch processing
Define the terms image, container, and snapshot in the context of Docker

9.2 Why Containers?

There is a replication crisis out there. Even given a script and the raw data, it is often difficult to replicate the results generated by a study.

Why is this difficult? One reason is that the results are tied to software and database versions.

This is one motivation for using containers - they are a way of packaging software that ‘freezes’ the software versions. It allows you to recreate the software environment associated with running your software,

Additionally, containers are portable - they function as mini-operating systems that can run on any machine that runs your container software. other people should be able to replicate your results even if they’re on a different operating system.

9.3 Terminology

In order to be unambiguous with our language, we’ll use the following definitions:

graph TD
    A["Docker Image"] --"apptainer pull/apptainer build"--> B
    B["Docker Container"] --"apptainer run"--> C
    C["Snapshot"]

Registry - collection of repositories that you pull docker images from. Example repositories include DockerHub and Quay.io.
Image - what you download from a registry - the “recipe” for building the software environment. Stored in a registry. use apptainer pull to get image from a registry. Can also generate image from a Dockerfile
Container - The executable software environment actually installed and running on a machine. Runnable. Generate from docker pull from a repository.
Snapshot File - An single archive file (.tar.gz) that contains the Docker container. Generate using docker save on a container. Also known as an image file. When we do
Dockerfile - a file that specifies how to install software and its dependencies. You often base a Dockerfile on an existing Docker image
Tag - a bit of metadata that is used to version a container.

9.4 What is a Container?

A container is a self-contained unit of software. It contains everything needed to run the software on a variety of machines. If you have the container software installed on your machine, it doesn’t matter whether it is MacOS, Linux, or Windows - the container will behave consistently across different operating systems and architectures.

The container has the following contents:

Software - The software we want to run in a container. For bioinformatics work, this is usually something like an aligner like bwa, or utilities such as samtools
Software Dependencies - various software packages needed to run the software. For example, if we wanted to run tidyverse in a container, we need to have R installed in the container as well
Filesystem - containers have their own isolated filesystem that can be connected to the “outside world” - everything outside of the container. We’ll learn more about customizing these with bind paths (Section 10.4).

In short, the container has everything needed to run the software. It is not a full operating system, but a smaller mini-version that cuts out a lot of cruft.

Containers are ephemeral. They leverage the the file system of their host to manage files. These are called both Volumes (the Docker term) and Bind Paths (the apptainer term).

What about Virtual Machines?

Virtual Machines (such as Vagrant) tend to be much larger, containing a full operating system, such as Linux and Windows in addition to the software and dependencies. They are made to be persistent and resumable, which means that they tend to be much larger than containers.

9.5 Docker vs. Apptainer

There are two basic ways to run Docker containers:

Using the Docker software
Using the Apptainer software (for HPC systems)

In general, Docker is used on systems where you have a high level of access to the system. This is because docker uses a special user group called docker that has essentially root level privileges.

This is not the case for HPC systems, which are shared. This is when we use Apptainer (which used to be called Singularity), which requires a much lower level of user privileges to execute tasks. For more info, see Section 10.2 .

9.6 Be Secure

Before we get started, security is always a concern when running containers. The docker group has elevated status on a system, so we need to be careful that when we’re running them, they aren’t introducing any system vulnerabilities. Note that on HPC systems, the main mechanism for running containers is apptainer, which is designed to be more secure.

These are mostly important when running containers that are web-servers or part of a web stack, but it is also important to think about when running jobs on HPC.

Here are some guidelines to think about when you are working with a container.

Use vendor-specific Docker Images when possible.
Use container scanners to spot potential vulnerabilities. DockerHub has a vulnerability scanner that scans your Docker images for potential vulnerabilities.
Avoid kitchen-sink images. One issue is when an image is built on top of many other images. It makes it really difficult to plug vulnerabilities. When in doubt, use images from trusted people and organizations.

Composing Containers

We can connect multiple containers together using a technology called Docker Compose. This means that we can run software pipelines that involve multiple containers.

For the most part, we don’t leverage Docker Compose, because it is usually used in putting together Web Application Stacks, though there are sometimes when it is helpful. In those cases, we can use Singularity compose to do it.

9.7 What you learned in this chapter

How containers enable reproducibility
Defined specific container terminology