7 Batch Processing on the Cloud

Now we’re prepared for the big one: batch processing on the DNAnexus platform. All of the shell and DNAnexus skills we’ve learned will be leveraged in this chapter.

Prep for Exercises

Make sure you are logged into the platform using dx login and that your course project is selected with dx select.

In your shell (either on your machine or in binder), make sure you’re in the bash_bioinfo_scripts/batch-processing/ folder:

cd batch-processing/

7.1 Learning Objectives

Utilize dx find data to find data files on the platform to batch process.
Iterate over files using Bash scripting and xargs on the platform to batch process them within a DNAnexus project.
Leverage dxFUSE to simplify your bash scripts
Utilize dx generate-batch-inputs/dx run --batch-tsv to batch process files
Utilize Python to batch process multiple files per worker.

7.2 Two Ways of Batching

graph LR;
  A[List files </br> using `dx data`] --> F{"|"}
  F --> E[`xargs` sh -c]
  E --> B[`dx run` </br> on file1];
  E --> C[`dx run` </br> on file2];
  E --> D[`dx run` </br> on file3];

Figure 7.1: Batch method 1. We list files and then pipe them into xargs, which generates individual dx-run statements.

graph LR;
  A[Submit array </br> of files </br> in `dx run`] --> B[Loop over array </br> of files </br> in worker];

Figure 7.2: Batch method 2. We first get our files onto the worker through a single dx run command, and then use xargs on the worker to cycle through them.

We actually have two methods of batching jobs using Swiss Army Knife:

Use xargs on our home system to run dx run statements for each file (Figure 7.1).
Submit an array of files as an input to Swiss Army Knife. Then process each file using the icmd input (Figure 7.2)

Both of these methods can potentially be useful.

7.3 Finding files using `dx find data`

dx find data is a command that is extremely helpful on the DNAnexus platform. Based on metadata and folder paths, dx find data will return a list of files that meet the criteria.

dx find data lets you search on the following types of metadata:

tags --tag
properties --property
name --name
type --type

It can output in a number of different formats. Including:

--brief - return only the file-ids
--json - return file information in JSON format
--verbose - this is the default setting
--delimited - return as a delimited text file

Of all of these, --brief and --json are the most useful for automation. --delimited is also helpful, but there is also a utility called dx generate-batch-inputs that will let us specify multiple inputs to process line by line.

7.4 Helpful `dx find data` examples

As we’re starting off in our batch processing journey, I wanted to provide some helpful recipes for selecting files.

7.4.1 Find all *.bam files in a project

You can use wildcard characters with the --name flag. Here, we’re looking for anything with the suffix “*.bam”.

#| eval: false
#| filename: batch-processing/dx-find-data-name.sh
dx find data --name "*.bam" --brief

7.4.2 Searching within a folder

You can add the --path command to search in a specific folder.

#| eval: false
#| filename: batch-processing/dx-find-path.sh
dx find data --name "*.bam" --path "data/"

7.4.3 Find all files with a field id

Take advantage of metadata associated with files when you can. If you are on UKB RAP, one of the most helpful properties to search is field_id.

Note: be careful with this one, especially if you are working on UK Biobank RAP. You don’t want to return 500,000 file ids. I would concentrate on the field ids that are aggregated on the population level, such as the pVCF files.

#| eval: false
#| filename: batch-processing/dx-find-data-field.sh
dx find data --property field_id="23148" --brief

7.4.4 Find all files that are of class `file`

There are a number of different object classes on the platform, such as file or applet

Search for all files in your project that have a file class.

#| eval: false
#| filename: batch-processing/dx-find-data-class.sh
dx find data --class file --brief

7.4.5 In General: Think about leveraging metadata

In general, think about leveraging metadata that is attached to your files.

For example, for the UKB Research Analysis Platform, data files in the Bulk/ folder in your project have multiple properties: field_id (the data field as specified by UK Biobank) and eid.

7.5 Using `xargs` to Batch Multiple Files

Ok, now we have a list of files from dx find data that meet our criteria. How can we process them one by one?

Remember our discussion of xargs? (Section 3.5) This is where xargs shines, when you provide it a list of files.

Remember, a really useful pattern for xargs is using it for variable expansion and starting a subshell to process individual files.

#| eval: false
#| filename: batch-processing/dx-find-xargs.sh
dx find data --name "*.bam" --brief | \
  xargs -I % sh -c "dx run app-swiss-army-knife -y -iin="%" \
  -icmd='samtools view -c \${in_name} > \${in_prefix-counts.txt}' \
  --tag samjob --destination results/'

The key piece of code we’re doing the variable expansion in is here:

#| eval: false
sh -c 'dx run app-swiss-army-knife -iin="%" \ 
  -icmd="samtools view -c \${in_name} > \${in_prefix}-counts.txt" \
  --tag samjob --destination results/'

We’re using sh -c to run a script as a subshell to execute the dx run statement.

Note that we’re specifying the helper variables here with a \:

\${in_name}

This escaping (\$) of the dollar sign is to prevent the variable expansion from happening in the top-level shell - the helper variable names need to be passed in to the subshell which needs to pass it onto the worker. Figuring this out took time and made my brain hurt.

This escaping is only necessary because we’re using xargs and passing our -icmd input into the worker. For the most part, you won’t need to escape the $. This is also a reason to write shell scripts that run on the worker.

When we run this command, we get the following screen output:

Using input JSON:
{
    "cmd": "samtools view -c $in_name > $in_prefix-counts.txt",
    "in": [
        {
            "$dnanexus_link": {
                "project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
                "id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
            }
        }
    ]
}

Calling app-GFxJgVj9Q0qQFykQ8X27768Y with output destination
  project-GGyyqvj0yp6B82ZZ9y23Zf6q:/results

Job ID: job-GJ2xVZ80yp62X5Z51qp191Y8

[more job info]

if we do a dx find jobs, we’ll see our jobs listed. Hopefully they are running:

dx find jobs --tag samjob
* Swiss Army Knife (swiss-army-knife:main) (running) job-GJ2xVf00yp62kx9Z8VK10vpQ
  tladeras 2022-10-11 13:57:59 (runtime 0:01:49)
* Swiss Army Knife (swiss-army-knife:main) (running) job-GJ2xVb80yp6KjQpxFJJBzv5k
  tladeras 2022-10-11 13:57:57 (runtime 0:00:52)
* Swiss Army Knife (swiss-army-knife:main) (runnable) job-GJ2xVZj0yp6FFFXG11j6YJ9V
  tladeras 2022-10-11 13:57:55 (runtime 0:01:15)
* Swiss Army Knife (swiss-army-knife:main) (runnable) job-GJ2xVZ80yp62X5Z51qp191Y8
  tladeras 2022-10-11 13:57:53 (runtime 0:00:56)

7.5.1 When batching, tag your jobs

It is critical that you tag your jobs in your dx run code with the --tag argument.

Why? You will at some point start up a bunch of batch jobs that might have some settings/parameters that were set wrong. That’s when you need the tag.

#| eval: false
dx find jobs --tag "samjob"

7.5.2 Using tags to `dx terminate` jobs

dx terminate <jobid> will terminate a running job with that job id. It doesn’t take a tag as input.

But again, xargs to the rescue. We can find our job ids with the tag samjob using dx find jobs and then pipe the --brief output into xargs to terminate each job id.

#| eval: false
dx find jobs --tag samjob --brief | xargs -I% sh -c "dx terminate %"

7.6 Submitting Multiple Files to a Single Worker

We talked about another method to batch process files on a worker (Figure 7.2). We can submit an array of files to a worker, and then process them one at a time on the worker.

The key is that we’re running xargs on the worker, not on our own machine to process each file.

#| eval: false
#| filename: batch-processing/batch-on-worker.sh
cmd_to_run="ls *.vcf.gz | xargs -I% sh -c 'bcftools stats % > \$(basename %).stats.txt'"

dx run swiss-army-knife \
  -iin="data/chr1.vcf.gz" \
  -iin="data/chr2.vcf.gz" \
  -iin="data/chr3.vcf.gz" \
  -icmd=${cmd_to_run}

In the variable $cmd_to_run, we’re putting a command that we’ll run on the worker. That command is:

#| eval: false
ls *.vcf.gz | xargs -I% sh -c "bcftools stats % > \$(basename %).stats.txt

We submitted an array of files in our dx run statement. So now they are transferred into our working directory on the worker. So we can list the files using ls *.vcf.gz and pipe that list into xargs.

Note that we lose the ability to use helper variables in our script when we process a list of files on the worker. So here we have to use \$(basename %), because we use () to expand a variable in a subshell, and we escape the $ here so that bash will execute the variable expansion on the worker.

Again, this is possible, but it may be easier to have a separate script that contains our commands, transfer that as an input to Swiss Army Knife, and run that script by specifying bash myscript.sh in our command.

7.7 Batching multiple inputs: `dx generate_batch_inputs`

What if you have multiple inputs that you need to batch with? This is where the dx generate_batch_inputs comes in.

For each input for an app, we can specify it using wildcard characters with regular expressions.

# | eval: false
dx generate_batch_inputs \
  --path "data/"\
  -iin="(.*)\.bam$"

Here we’re specifying a single input in, and we’ve supplied a wildcard search. It’s going to look in data/ for this particular pattern (we’re looking for bam files).

If we do this, we’ll get the following response:

Found 4 valid batch IDs matching desired pattern.
Created batch file dx_batch.0000.tsv

So, there is 1 .tsv file that was generated by dx generate_batch_inputs on our machine.

If we have many more input files, say 3000 files, it would generate 3 .tsv files. Each of these .tsv files contains about 1000 files per line. We can run these individual jobs with:

#| eval: false
dx run swiss-army-knife --batch-tsv dx_batch.0000.tsv \
   -icmd='samtools stats ${in_name} > ${in_prefix}.stats.txt ' \
   --destination "/Results/" \
   --detach --allow-ssh \
   --tag bigjob

This will generate 4 jobs from the dx_batch.0000 file to process the individual files. Each tsv file will generate up to 1000 jobs.

7.7.1 Drawbacks to `dx generate_batch_inputs`/`dx run --batch-tsv`

The largest drawback to using dx generate_batch_inputs is that each column must correspond to an individual input name - you can’t submit an array of files to a job this way.

7.7.2 For More Information

The Batch Jobs documentation page has some good code examples for dx generate_batch_inputs here: https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs/

7.8 Programatically Submitting Arrays of Files for a job

You can also use Python to build dx run statements, which is especially helpful when you want to submit arrays of 100+ files to a worker.

See https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/guide-to-analyzing-large-sample-sets for more info.

7.9 What you learned in this chapter

This was a big chapter, and built on everything you’ve learned in the previous chapters.

We put together the output of dx find data --brief (Section 7.3) with a pipe (|), and used xargs (Section 7.5) to spawn jobs per set of files.

Another way to process files is to upload them onto a worker and process them (Section 7.6).

We also learned of alternative approaches using dx generate_batch_inputs/dx run --batch-tsv and using Python to build the dx run statements.