7 Batch Processing on the Cloud
Now we’re prepared for the big one: batch processing on the DNAnexus platform. All of the shell and DNAnexus skills we’ve learned will be leveraged in this chapter.
Make sure you are logged into the platform using dx login
and that your course project is selected with dx select
.
In your shell (either on your machine or in binder), make sure you’re in the bash_bioinfo_scripts/batch-processing/
folder:
cd batch-processing/
7.1 Learning Objectives
- Utilize
dx find data
to find data files on the platform to batch process. - Iterate over files using Bash scripting and
xargs
on the platform to batch process them within a DNAnexus project. - Leverage dxFUSE to simplify your bash scripts
- Utilize
dx generate-batch-inputs
/dx run --batch-tsv
to batch process files - Utilize Python to batch process multiple files per worker.
7.2 Two Ways of Batching
We actually have two methods of batching jobs using Swiss Army Knife:
- Use
xargs
on our home system to rundx run
statements for each file (Figure 7.1). - Submit an array of files as an input to Swiss Army Knife. Then process each file using the
icmd
input (Figure 7.2)
Both of these methods can potentially be useful.
7.3 Finding files using dx find data
dx find data
is a command that is extremely helpful on the DNAnexus platform. Based on metadata and folder paths, dx find data
will return a list of files that meet the criteria.
dx find data
lets you search on the following types of metadata:
- tags
--tag
- properties
--property
- name
--name
- type
--type
It can output in a number of different formats. Including:
--brief
- return only the file-ids--json
- return file information in JSON format--verbose
- this is the default setting--delimited
- return as a delimited text file
Of all of these, --brief
and --json
are the most useful for automation. --delimited
is also helpful, but there is also a utility called dx generate-batch-inputs
that will let us specify multiple inputs to process line by line.
7.4 Helpful dx find data
examples
As we’re starting off in our batch processing journey, I wanted to provide some helpful recipes for selecting files.
7.4.1 Find all *.bam files in a project
You can use wildcard characters with the --name
flag. Here, we’re looking for anything with the suffix “*.bam”.
#| eval: false
#| filename: batch-processing/dx-find-data-name.sh
dx find data --name "*.bam" --brief
7.4.2 Searching within a folder
You can add the --path
command to search in a specific folder.
#| eval: false
#| filename: batch-processing/dx-find-path.sh
dx find data --name "*.bam" --path "data/"
7.4.3 Find all files with a field id
Take advantage of metadata associated with files when you can. If you are on UKB RAP, one of the most helpful properties to search is field_id
.
Note: be careful with this one, especially if you are working on UK Biobank RAP. You don’t want to return 500,000 file ids. I would concentrate on the field ids that are aggregated on the population level, such as the pVCF files.
#| eval: false
#| filename: batch-processing/dx-find-data-field.sh
dx find data --property field_id="23148" --brief
7.4.4 Find all files that are of class file
There are a number of different object classes on the platform, such as file
or applet
Search for all files in your project that have a file
class.
#| eval: false
#| filename: batch-processing/dx-find-data-class.sh
dx find data --class file --brief
7.4.5 In General: Think about leveraging metadata
In general, think about leveraging metadata that is attached to your files.
For example, for the UKB Research Analysis Platform, data files in the Bulk/
folder in your project have multiple properties: field_id
(the data field as specified by UK Biobank) and eid
.
7.5 Using xargs
to Batch Multiple Files
Ok, now we have a list of files from dx find data
that meet our criteria. How can we process them one by one?
Remember our discussion of xargs
? (Section 3.5) This is where xargs
shines, when you provide it a list of files.
Remember, a really useful pattern for xargs
is using it for variable expansion and starting a subshell to process individual files.
#| eval: false
#| filename: batch-processing/dx-find-xargs.sh
dx find data --name "*.bam" --brief | \
xargs -I % sh -c "dx run app-swiss-army-knife -y -iin="%" \
-icmd='samtools view -c \${in_name} > \${in_prefix-counts.txt}' \
--tag samjob --destination results/'
The key piece of code we’re doing the variable expansion in is here:
#| eval: false
sh -c 'dx run app-swiss-army-knife -iin="%" \
-icmd="samtools view -c \${in_name} > \${in_prefix}-counts.txt" \
--tag samjob --destination results/'
We’re using sh -c
to run a script as a subshell to execute the dx run
statement.
Note that we’re specifying the helper variables here with a \
:
\${in_name}
This escaping (\$
) of the dollar sign is to prevent the variable expansion from happening in the top-level shell - the helper variable names need to be passed in to the subshell which needs to pass it onto the worker. Figuring this out took time and made my brain hurt.
This escaping is only necessary because we’re using xargs
and passing our -icmd
input into the worker. For the most part, you won’t need to escape the $
. This is also a reason to write shell scripts that run on the worker.
When we run this command, we get the following screen output:
Using input JSON:
{
"cmd": "samtools view -c $in_name > $in_prefix-counts.txt",
"in": [
{
"$dnanexus_link": {
"project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
"id": "file-BZ9YGpj0x05xKxZ42QPqZkJY"
}
}
]
}
Calling app-GFxJgVj9Q0qQFykQ8X27768Y with output destination
project-GGyyqvj0yp6B82ZZ9y23Zf6q:/results
Job ID: job-GJ2xVZ80yp62X5Z51qp191Y8
[more job info]
if we do a dx find jobs
, we’ll see our jobs listed. Hopefully they are running:
dx find jobs --tag samjob
* Swiss Army Knife (swiss-army-knife:main) (running) job-GJ2xVf00yp62kx9Z8VK10vpQ
tladeras 2022-10-11 13:57:59 (runtime 0:01:49)
* Swiss Army Knife (swiss-army-knife:main) (running) job-GJ2xVb80yp6KjQpxFJJBzv5k
tladeras 2022-10-11 13:57:57 (runtime 0:00:52)
* Swiss Army Knife (swiss-army-knife:main) (runnable) job-GJ2xVZj0yp6FFFXG11j6YJ9V
tladeras 2022-10-11 13:57:55 (runtime 0:01:15)
* Swiss Army Knife (swiss-army-knife:main) (runnable) job-GJ2xVZ80yp62X5Z51qp191Y8
tladeras 2022-10-11 13:57:53 (runtime 0:00:56)
7.5.1 When batching, tag your jobs
It is critical that you tag your jobs in your dx run
code with the --tag
argument.
Why? You will at some point start up a bunch of batch jobs that might have some settings/parameters that were set wrong. That’s when you need the tag.
#| eval: false
dx find jobs --tag "samjob"
7.5.2 Using tags to dx terminate
jobs
dx terminate <jobid>
will terminate a running job with that job id. It doesn’t take a tag as input.
But again, xargs
to the rescue. We can find our job ids with the tag samjob
using dx find jobs
and then pipe the --brief
output into xargs
to terminate each job id.
#| eval: false
dx find jobs --tag samjob --brief | xargs -I% sh -c "dx terminate %"
7.6 Submitting Multiple Files to a Single Worker
We talked about another method to batch process files on a worker (Figure 7.2). We can submit an array of files to a worker, and then process them one at a time on the worker.
The key is that we’re running xargs
on the worker, not on our own machine to process each file.
#| eval: false
#| filename: batch-processing/batch-on-worker.sh
cmd_to_run="ls *.vcf.gz | xargs -I% sh -c 'bcftools stats % > \$(basename %).stats.txt'"
dx run swiss-army-knife \
-iin="data/chr1.vcf.gz" \
-iin="data/chr2.vcf.gz" \
-iin="data/chr3.vcf.gz" \
-icmd=${cmd_to_run}
In the variable $cmd_to_run
, we’re putting a command that we’ll run on the worker. That command is:
#| eval: false
ls *.vcf.gz | xargs -I% sh -c "bcftools stats % > \$(basename %).stats.txt
We submitted an array of files in our dx run
statement. So now they are transferred into our working directory on the worker. So we can list the files using ls *.vcf.gz
and pipe that list into xargs
.
Note that we lose the ability to use helper variables in our script when we process a list of files on the worker. So here we have to use \$(basename %)
, because we use ()
to expand a variable in a subshell, and we escape the $
here so that bash will execute the variable expansion on the worker.
Again, this is possible, but it may be easier to have a separate script that contains our commands, transfer that as an input to Swiss Army Knife, and run that script by specifying bash myscript.sh
in our command.
7.7 Batching multiple inputs: dx generate_batch_inputs
What if you have multiple inputs that you need to batch with? This is where the dx generate_batch_inputs
comes in.
For each input for an app, we can specify it using wildcard characters with regular expressions.
# | eval: false
dx generate_batch_inputs \
--path "data/"\
-iin="(.*)\.bam$"
Here we’re specifying a single input in
, and we’ve supplied a wildcard search. It’s going to look in data/
for this particular pattern (we’re looking for bam files).
If we do this, we’ll get the following response:
Found 4 valid batch IDs matching desired pattern.
Created batch file dx_batch.0000.tsv
So, there is 1 .tsv
file that was generated by dx generate_batch_inputs
on our machine.
If we have many more input files, say 3000 files, it would generate 3 .tsv
files. Each of these .tsv
files contains about 1000 files per line. We can run these individual jobs with:
#| eval: false
dx run swiss-army-knife --batch-tsv dx_batch.0000.tsv \
-icmd='samtools stats ${in_name} > ${in_prefix}.stats.txt ' \
--destination "/Results/" \
--detach --allow-ssh \
--tag bigjob
This will generate 4 jobs from the dx_batch.0000
file to process the individual files. Each tsv
file will generate up to 1000 jobs.
7.7.1 Drawbacks to dx generate_batch_inputs
/dx run --batch-tsv
The largest drawback to using dx generate_batch_inputs
is that each column must correspond to an individual input name - you can’t submit an array of files to a job this way.
7.7.2 For More Information
The Batch Jobs documentation page has some good code examples for dx generate_batch_inputs
here: https://documentation.dnanexus.com/user/running-apps-and-workflows/running-batch-jobs/
7.8 Programatically Submitting Arrays of Files for a job
You can also use Python to build dx run
statements, which is especially helpful when you want to submit arrays of 100+ files to a worker.
See https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/guide-to-analyzing-large-sample-sets for more info.
7.9 What you learned in this chapter
This was a big chapter, and built on everything you’ve learned in the previous chapters.
We put together the output of dx find data --brief
(Section 7.3) with a pipe (|
), and used xargs
(Section 7.5) to spawn jobs per set of files.
Another way to process files is to upload them onto a worker and process them (Section 7.6).
We also learned of alternative approaches using dx generate_batch_inputs
/dx run --batch-tsv
and using Python to build the dx run
statements.