6  Working with files using dx run

Prep for Exercises

Make sure you are logged into the platform using dx login and that your course project is selected with dx select.

In your shell, make sure you’re in the bash_bioinfo_scripts/dx-run/ folder:

cd dx-run/

6.1 Learning Objectives

  1. Utilize an all purpose app (swiss-army-knife) to run bioinformatics jobs using dx run on files in a DNAnexus project
  2. Utilize multiple inputs in a Swiss Army Knife job.
  3. Explain and utilize multiple options in a Swiss Army Knife job.
  4. Wrap R and Python scripts using a bash script to run it on the cloud using dx run
  5. Tag output files on the platform for downstream use.

6.2 Doing Bioinformatics with the Swiss Army Knife (SAK) App

Let’s focus on key bioinformatics tasks we need to accomplish. To make things easier, we’ll be using Swiss Army Knife, which is an app on the platform that contains a number of commonly used bioinformatics utilities, including:

  • bcftools
  • bedtools
  • BGEN
  • bgzip
  • BOLT-LMM
  • Picard
  • Plato
  • plink
  • plink2
  • QCTool
  • REGENIE
  • sambamba
  • samtools
  • seqtk
  • tabix
  • vcflib
  • vcftools

Swiss Army Knife also contains R and Python 3.

Everything you’ll learn in this section will also be applicable to building apps on the platform as well. You’ll learn the foundations of running a script on the platform, which is halfway to building your own apps on the platform.

6.3 Running Jobs on the DNAnexus platform using dx run

Our main command for running jobs is dx run. dx run lets us submit jobs to be run on the platform. These jobs can be to process files (such as aligning FASTQ files to a genome), or they can be for web apps, such as LocusZoom (for visualizing).

If you have used SLURM on an on-premise HPC system, the equivalent command would be srun, and if you have used PBS, the equivalent command would be sbatch.

6.4 Try out your first job

In the sample project, you will see a .bam file in data/ called NA12878.bam with no index. Let’s create an index for this file with sambamba by running it with dx run app-swiss-army-knife.

#| eval: false
#| filename: dx-run/run-sambamba.sh

dx run app-swiss-army-knife \
  -iin="/data/NA12878.bam" \
  -icmd="sambamba index *"

Let’s take this code apart. We start our command with dx run and the name of the app on the platform we want to use: app-swiss-army-knife.

The second line contains the file input we want to process, which is a file in our current project. Note that the input is specified as -iin, not --in or --iin. Using a single hyphen - here instead of a double hyphen -- for our inputs is different than we might expect for other linux/UNIX parameters.

The third line is the actual command we want to run in Swiss Army Knife. We want to run sambamba index on our input file.

When you run the above code, either by pasting it into your shell, or using bash run-sambamba.sh you’ll see the following response.

Using input JSON:
{
    "cmd": "sambamba index *",
    "in": [
        {
            "$dnanexus_link": {
                "project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
                "id": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"
            }
        }
    ]
}

Confirm running the executable with this input [Y/n]: Y
Calling app-GFxJgVj9Q0qQFykQ8X27768Y with output destination
  project-GGyyqvj0yp6B82ZZ9y23Zf6q:/

Job ID: job-GJ0G58Q0yp65gzJzKjYv04bZ
Watch launched job now? [Y/n] Y

Respond with Y when it asks

Confirm executable with this input [Y/n]:

And when it asks you

Watch launched job now? [Y/n]

Try it out!

Run the above code (also in dx-run/run-sambamba.sh) and look at the log that is generated. Take a look at it carefully, especially the output log.

What file does it create? Where does it create that file?

If you use dx tree to take a look at the project, you’ll see that there was a file NA12878.bam.bai that was created in the root of our project.

dx tree

── data
│   ├── NA12878.bai
│   ├── NA12878.bam
│   ├── NC_000868.fasta
│   ├── NC_001422.fasta
│   ├── small-celegans-sample.fastq
│   ├── SRR100022_chrom20_mapped_to_b37.bam
│   ├── SRR100022_chrom21_mapped_to_b37.bam
│   └── SRR100022_chrom22_mapped_to_b37.bam
└── NA12878.bam.bai

If we wanted to control where our index file ended up, we can use the --destination option to specify our destination, such as

--destination data/

6.4.1 What about the outputs?

One of the nice things about Swiss Army Knife is that it automatically transfers files that we generate on the worker. In our above job, we generated the corresponding index file NA12878.bam.bai from our -icmd input. It was automatically transferred over from the worker to the root of our project.

If we wanted to change the output folder, we can use the --destination flag, like below:

#| eval: false

dx run app-swiss-army-knife \
  -iin="/data/NA12878.bam" \
  -icmd="sambamba index *" \
  --destination "data/"

Another nice thing about dx run is that it will automatically create the folder we specify with --destination if it doesn’t exist in project storage.

What if we run our statement multiple times?

We can actually run our dx run statement multiple times. The output will be multiple files with the same name in the destination directory.

What’s going on here? Well, files are considered as objects on the platform. Each copy generated by our dx run are considered unique objects on the platform, with separate file identifiers.

6.5 Building on our first dx run job

We’ve seen how to specify file inputs to Swiss Army Knife. Let’s dive into some of the intricacies of using the -icmd input.

6.5.1 Submitting a script to the worker

The -icmd input is helpful for small scripting jobs, but you often want to do a bunch of things in a Swiss Army Knife job.

Bash scripting to the rescue! Say we have a script called run-on-worker.sh that takes 1 positional input, which is a file path. We’ll run this script on the worker:

#| eval: false
#| filename: worker-scripts/run-on-worker.sh
sambamba $1 > $1.bai
samtools stats $1 > $1.stats.txt

When we set up a dx run app-swiss-army-knife, we’ll use this script as one of the inputs to dx run:

#| eval: false
#| filename: dx-run/dx-run-script.sh
dx run app-swiss-army-knife \
  -iin="worker_scripts/run-on-worker.sh" \
  -iin="data/NA12878.bam" \
  -icmd="bash run-on-worker.sh $in_path"

The magic here is that we’re using bash in our cmd input to run the script run-on-worker.sh. Notice this script is already in our project in the worker_scripts/ folder, so we can submit it as a file input:

-iin="worker_scripts/run-on-worker.sh"
Why this is important on the platform

This is a common pattern that we’ll use when we need to do more complicated things with Swiss Army Knife. It doesn’t seem that helpful right now, but it will once we get to batch scripting.

You’ll notice that we are using a special helper variable here called ${in_path} - this gives the current path of the file on the worker. It is tremendously helpful in scripting. Let’s learn about these built-in helper variables next.

6.5.2 Bash Helper Variables

There are some helper variables based on the file inputs specified in -iin.

These variables are really helpful in scripting in our -icmd input. Say we ran our command:

dx run app-swiss-army-knife \
  -iin="/data/NA12878.bam" \ 
  -icmd="sambamba index ${in_path}"

You’ll see that we are using a special variable called $in_path here to specify the file name. This is called a helper variable - they are available based on the different file inputs we submit to Swiss Army Knife.

For our one input -iin="data/NA12878.bam", this is what these helper variables contain:

Bash Variable Value Notes
$in data/NA12878.bam Location in project storage
$in_path ~/NA12878.bam Location in worker storage
$in_name NA12878.bam File name (without folder path)
$in_prefix NA12878 File name without the suffix

$in_prefix is really helpful when you want to create files that have the same prefix as our file. Say we wanted to run samtools view -c on our file. We want a file called NA12878.count.txt that contains the read counts.

dx run app-swiss-army-knife \
  -iin="data/NA12878.bam" \
  -icmd="samtools view -c ${in_path} > ${in_prefix}.counts.txt"

What happens when the command is run on the worker is that the variables are expanded like this:

samtools view -c ~/NA12878.bam > NA12878.counts.txt
What is the difference between $in and $in_path?

You may have looked at the above table and been slightly confused about these two variables.

$in is the file location in project storage, while $in_path is the file location in the temporary worker storage.

When writing scripts that run on the worker with the -icmd input, $in_path is much more useful.

Why are we learning about helper variables?

Writing scripts that can be run on the worker is part of the app building process, which lets us specify our own executables.

If you can get a handle on working with helper variables in -icmd, then you are most of the way to building an app.

6.6 More Swiss Army Knife

There is a table in the Swiss Army Knife documentation that specifies the inputs and how to configure the -icmd parameter for each operation.

One thing you’ll notice is that some of them require multiple file inputs. How do you specify them?

6.6.1 Multiple file inputs to Swiss Army Knife

One thing that may not be apparent to you is that you can specify multiple file inputs with multiple -iin options.

The Swiss Army Knife Documentation notes that -iin is actually an array file input. That means that we can submit multiple files like this:

#| eval: false
#| filename: dx-run/dx-run-multiple-inputs.sh
dx run app-swiss-army-knife \
  -iin="data/SRR100022_chrom20_mapped_to_b37.bam" \
  -iin="data/SRR100022_chrom21_mapped_to_b37.bam" \
  -iin="data/SRR100022_chrom22_mapped_to_b37.bam" \
  -icmd="ls *.bam | xargs -I% sh -c 'samtools view -c % > %.count.txt'"

If we want to process each of these files, we’ll have to use xargs in our icmd statement:

-icmd="ls *.bam | xargs -I% sh -c 'samtools view -c % > %.count.txt'"

We’ll cover more of this in Chapter 7.

6.7 dx run runs deep

If you’ve looked at the dx run help page, you’ll notice it is quite long. There are a lot of options. Let’s look at a few of them that are really helpful when you’re getting started.

  • --tag - please use these. Your jobs will be findable with a tag.
  • --destination - destination folder in project. If the path doesn’t exist, it will be created.
  • --instance-type - Instance types used in the job.
  • --watch - run dx watch for the job id that is generated
  • -y - non-interactive mode. Replies yes to the questions and starts up dx watch for that job id.
  • --allow-ssh - very helpful in debugging jobs. Allows you to dx ssh into that worker and check out the status.
  • --batch-tsv - works with the tab-delimited files that you generate using dx generate_batch_inputs.
  • --detach - if a subjob of a job, it detaches the job. This is really important in batch operations where a top-level job is generating a subjob. Running the subjobs as detached means that the entire top-level job will not fail if a subjob fails.

I’ll outline some examples where we leverage the options below.

6.7.1 Save our output files into a different folder

As we mentioned above, by default our output files are generated in the root folder of the project. If we want to control this, we can use the --destination option.

Remember, if the folder our --destination option doesn’t exist, dx run will create that folder and send the outputs into the newly created folder.

#| eval: false
#| filename: dx-run/run-destination.sh
dx run app-swiss-army-knife \
  -iin="data/chr1.bed" \
  -iin="data/chr1.bim" \
  -iin="data/chr1.fam" \
  -icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
  --destination "results/"

6.7.2 Run our job on a different instance type

We can specify a different instance type (Section 4.7) with the --instance-type parameter.

#| eval: false
#| filename: dx-run/run-instance.sh
dx run app-swiss-army-knife \
  -iin="data/chr1.bed" \
  -iin="data/chr1.bim" \
  -iin="data/chr1.fam" \
  -icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
  --instance-type "mem1_ssd1_v2_x8"

6.7.3 Tag and rename our job

We can change the name of our job using --name - this can be very helpful when running a bunch of jobs. Here we’re adding $in_prefix so we can see what Chromosome we’re running on. Very helpful in batch submissions.

Adding a tag to our job, such as the run number, can be very helpful in terminating a large set of jobs (Section 7.5.2).

#| eval: false
#| filename: dx-run/run-job-tag-name.sh
dx run app-swiss-army-knife \
  -iin="data/chr1.bed" \
  -iin="data/chr1.bim" \
  -iin="data/chr1.fam" \
  -icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
  --tag "plink_job" --name "Run PLINK on fileset: $in_prefix"

6.7.4 Clone an failed job

If you have run a job on a lower priority, your job has a chance of getting bumped (stopped). If that’s the case, your job will fail.

That’s when using the --clone parameter can come in handy.

#| eval: false
dx run app-swiss-army-knife --clone <job-id>

When we learn about JSON (#sec-json), we’ll learn a recipe for restarting a list of these failed jobs.

Keep in Mind: dx run --tag

Keep in mind when you are using the --tag option with dx run, it does not tag the outputs of your jobs with that tag. It only tags the jobs.

6.8 dxFUSE: Simplify your scripts with multiple inputs

Ok, we learned about submitting files as inputs. Is there a way to run scripts with less typing?

There is a general method for working in Swiss Army Knife and other apps that lets you bypass specifying file inputs explicitly in your dx run statement: using the dxFUSE file system.

The main thing you need to know as a developer is that you can prepend a /mnt/project/ to your file path to use in your -icmd input directly. For the dx run statement in Section 6.6.2, we can rewrite it as the following:

#| eval: false
#| filename: dx-run/run-job_dxfuse.sh

dx run app-swiss-army-knife \
  -icmd="plink --bfile /mnt/project/data/plink/chr1 --geno 0.01 --make-bed --out chr1_qc"

In our -icmd input, we are referring to the file location of the triplet by pre-pending a /mnt/project/ to the project path of the triplet: data/chr1 to make a new path:

/mnt/project/data/chr1

This new path lets us specify files from project storage without directly specifying them as an input.

We could rewrite this command

#| eval: false

dx run app-swiss-army-knife \
  -iin="data/NA12878.bam" \
  -icmd="samtools view -c NA12878.bam > NA12878.counts.txt"

As this:

#| eval: false

dx run app-swiss-army-knife \
  -icmd="samtools view -c /mnt/project/data/NA12878.bam > NA12878.counts.txt"

We can also use an ls on a dxFUSE path in our -icmd. This is a pattern that will become especially helpful when we do process multiple files on a single worker (Section 7.6)

#| eval: false

dx run app-swiss-army-knife \
  -icmd="ls /mnt/project/data/*.bam | xargs -I% sh -c 'samtools stats % > \$(basename %).stats.txt'" \
  --destination "results/"

Our cmd input is kind of complicated here. Within the subshell, we’re taking advantage of the basename command, which returns the bare filename of our object. We need this because we can’t currently write using dxfuse.

sh -c 'samtools stats % > \$(basename %).stats.txt'

What’s going on with the \$(basename %)? We need to use the parentheses (()) because it’s in a subshell command, and we escape the $ in front of it, so that our base shell doesn’t expand it, since we want the shell on the worker to expand it instead.

So if our first file is /mnt/project/data/NA12878.bam, then the substitution on the worker goes like this:

samtools stats /mnt/project/data/NA12878.bam > NA12878.stats.txt

Most of these issues are because we need to specify subshells in our cmd and use dxfuse. Honestly, I would probably put the xargs code into a script, and submit that script into the worker, rather than specify it as a single line in our cmd input. (Section 6.5.1)

6.8.1 Advantages of dxFUSE

Most of my scripts leverage dxFUSE. Let’s talk about the advantages of using dxFUSE.

  • You don’t have to specify files as inputs. This avoids a lot of typing in your scripts. This can be very helpful if there are a number of files you need as inputs per job.
  • Allows you to stream files. dxFUSE is a file streaming method. That is, it will stream files bit by bit to the worker. This can make the overall job run somewhat faster, as it can stream files as needed, and not have to explicitly transfer each file to the worker.

6.8.2 Disadvantages of dxFUSE

There are some disadvantages to using dxFUSE you should be aware of:

  • It obscures your audit trail. Because we are not specifying the files as inputs, it is less obvious as to which files were processed with each job.
  • It works best as read-only. dxFUSE has limited write capabilities at this point, so it is best to treat it as a read-only file system.
  • You have to be aware of how it resolves duplicate file names in a folder. dxFUSE works differently than what you might expect when it encounters multiple file names. It separates out duplicate files by prefacing them with a number. For example, if you generated three files called chr1.results.txt in a folder called data, this is how you would refer to each of them:
data/chr1_results.txt
data/1/chr1_results.txt
data/2/chr1_results.txt

This name resolution can be tricky to work with.

  • It works best with good file hygiene practices. One of the confusing things is that you can create files with the same filename in the same folder (see previous point). This can easily happen if you clone a job. Because these two files have unique identifiers, they are allowed to have identical file names on the platform. This is one reason to tag each set of jobs with a run number (such as run001), so you can make sure that all of the updated files are the correct version.

6.9 Using a Docker Image with dx run/Swiss Army Knife

See the containers chapter (Chapter 10) for more information.

6.10 What you learned in this chapter

Kudos and congrats for reaching this far. We built upon our shell scripting skills (Chapter 3) and our knowledge of cloud-computing (Chapter 4) to start running jobs effectively in the cloud.