7  Shell Scripting Basics

7.1 Learning Objectives

  1. Utilize positional arguments to generalize our scripts
  2. Articulate the three streams of a command line utility
  3. Define variables for use in a bash script
  4. Iterate a script over a set of files using xargs loops
  5. Wrap executables and scripts in R/Python into a Bash script

7.2 Review of Bash scripting

Bash scripting is often referred to as a useful “glue language” on the internet. Although a lot of functionality can be covered by both JavaScript and Python, bash scripting is still very helpful to know.

We are going to cover Bash scripting because it is the main shell that is available to us on HPC machines, which are Ubuntu-based.

We will be using Bash scripts as “glue” for multiple applications in HPC computing, including:

  1. Wrapping scripts from other languages such as R or Python so we can run them using dx run on a app such as Swiss Army Knife
  2. Naming outputs according to file input names
  3. Specifying inputs and outputs in a workflow built by Workflow Description Language (WDL).

As you can see, knowing Bash is extremely helpful when running jobs on HPC.

7.3 Our first script with positional arguments

Say we have samtools installed on our own machine. Let’s start with a basic script and build from there. We’ll call it sam_run.sh. With nano, a text editor, we’ll start a very basic bash script and build its capabilities out.

#!/bin/bash/
samtools stats $1 > $2

Let’s take a look at the command that we’re running first. We’re going to run samtools stats, which will give us statistics on an incoming bam or sam file and save it in a file. We want to be able to run our script like this:

bash sam_run my_file.bam out_stats.txt

When we run it like that, sam_run.sh will run samtools stat like this:

samtools stats my_file.bam > out_stats.txt

So what’s going on here is that there is some substitution using common arguments. Let’s look at these.

7.3.1 Positional Arguments such as $1

How did the script know where to substitute each of our arguments? It has to do with the argument variables. Arguments (terms that follow our command) are indexed starting with the number 1. We can access the value at the first position using the special variable $1.

Note that this works even in quotes.

So, to unpack our script, we are substituting our first argument for the $1, and our second argument for the $2 in our script.

Test yourself

How would we rewrite sam_run.sh if we wanted to specify the output file as the first argument and the bam file as the second argument?

#!/bin/bash/
samtools stats $1 > $2

For this script, we would switch the positions of $1 and $2.

#!/bin/bash/
samtools stats $2 > $1

And we would run sam_run.sh like this:

bash sam_run.sh my_file.bam out_stats.txt

7.4 Using pipes: STDIN, STDOUT, STDERR

We will need to use pipes to chain our commands together. Specifically, we need to take a command that generates a list of files on the cluster shared filesystem, and then spawns individual jobs to process each file. For this reason, understanding a little bit more about how pipes (|) work in Bash is helpful.

If we want to understand how to chain our scripts together into a pipeline, it is helpful to know about the different streams that are available to the utilities.

graph LR
  A(STDIN) --> E[run_samtools.sh]
  E --> B(STDOUT)
  E --> C(STDERR)

Figure 7.1: Inputs/outputs to a script

Every script has three streams available to it: Standard In (STDIN), Standard Out (STDOUT), and Standard Error (STDERR) (Figure 7.1).

STDIN contains information that is directed to the input of a script (usually text output via STDOUT from another script).

Why do these matter? To work in a Unix pipeline, a script must be able to utilize STDIN, and generate STDOUT, and STDERR.

Specifically, in pipelines, STDOUT of a script (here it’s run_samtools) is directed into STDIN of another command (here wc, or word count)

graph LR
  E[run_samtools.sh] --> B(STDOUT)
  B --> F{"|"}
  E --> C(STDERR)
  F --> D("STDIN (wc)")
  D --> G[wc]

Figure 7.2: Piping a script run_samtools.sh into another command (wc)

We will mostly use STDOUT in our bash scripts, but STDERR can be really helpful in debugging what’s going wrong.

Why this is important on the Cluster

We’ll use pipes and pipelines not only in starting a bunch of jobs using batch scripting on our home computer, but also when we are processing files within a job.

7.4.1 For more info about pipes and pipelines

https://swcarpentry.github.io/shell-novice/04-pipefilter/index.html https://datascienceatthecommandline.com/2e/chapter-2-getting-started.html?q=stdin#combining-command-line-tools

7.5 Batch Processing Basics: Iterating using xargs

A really common pattern is taking a delimited list of files and doing something with them. We can do some useful things such as seeing the first few lines of a set of files, or doing some sort of processing with the set of jobs.

Don’t xargs for HPC jobs

You might be tempted to use xargs with srun to work on a bunch of files. It’s worth trying once so you can see the mechanics of how jobs are processed.

In general, I don’t recommend it in practice because if you spawn 1000 jobs using xargs, there’s no real mechanism to terminate that 1000 jobs, except one by one. With sbatch, all your jobs in batch mode run as subjobs, which means you can terminate the parent job to terminate all of the subjobs.

Again, this is a good reason to use a workflow runner in your day to day work. You don’t have to worry about jobs and subjobs. It takes a little setup, but it will make your life easier in general.

Let’s start out with a list of files:

source ~/.bashrc #| hide_line
ls data/*.sh
data/batch-on-worker.sh

Now we have a list of files, let’s look at the first few lines of each of them, and print a separator --- for each.

#| filename: scripting-basics/xargs_example.sh
source ~/.bashrc #| hide_line
ls data/*.sh | xargs -I% sh -c 'head %; echo "\n---\n"'
#!/bash/bin

cmd_to_run="ls *.vcf.gz | xargs -I% sh -c "bcftools stats % > %.stats.txt"

dx run swiss-army-knife \
  -iin="data/chr1.vcf.gz" \
  -iin="data/chr2.vcf.gz" \
  -iin="data/chr3.vcf.gz" \
  -icmd=${cmd_to_run}
---
dx find data --name "*.bam" --brief
---

Let’s take this apart piece by piece.

xargs takes an -I argument that specifies a placeholder. In our case, we are using % as our placeholder in this statement.

We’re passing on each filename from ls into the following code:

sh -c 'head %; echo "---\n"'

The sh -c opens a subshell so that we can execute our command for each of the files in our list. We’re using sh -c to run:

'head %; echo "---\n"'

So for our first file, 01-scripting-basics.qmd, we are substituting that for % in our command:

'head hpc-basics.qmd; echo "---\n"'

For our second file, hpc-basics.qmd, we would substitute that for the %:

'head hpc-basics.qmd; echo "---\n"'

Until we cycle through all of the files in our list.

7.5.1 The Basic xargs pattern

graph LR
  A["ls *.bam"] --> B{"|"} 
  B --> C["xargs -I% sh -c"] 
  C --> D["command_to_run %"]

Figure 7.3: Basics of using xargs to iterate on a list of files

As you cycle through lists of files, keep in mind this basic pattern (Figure 7.3):

ls <wildcard> | xargs -I% sh -c "<command to run> %"
Test Yourself

How would we modify the below code to do the following?

  1. List only .json files in our data/ folder using ls
  2. Use tail instead of head
ls *.txt | xargs -I% sh -c "head %; echo '---\n'"
ls data/*.json | xargs -I% sh -c "tail %; echo '---\n'"
Why this is important on HPC

We can use xargs to execute small batch jobs on a small number of files. This especially becomes powerful on the cluster when we use ls to list files in our HPC project.

Note that as we graduate to workflow tools like WDL/Nextflow, there are other mechanisms for running jobs on multiple files (such as WDL/Cromwell) that we should move to.

Trust me; you don’t want to have to handle iterating through a huge directory and handling when routines give an error, or your jobs get interrupted. Rerunning and resuming failed jobs are what workflow runner tools excel at.

7.5.2 For more information

https://www.baeldung.com/linux/xargs-multiple-arguments

7.6 Variables in Bash Scripts

We’ve already encountered a placeholder variable, %, that we used in running xargs. Let’s talk about declaring variables in bash scripts and using them using variable expansion.

In Bash, we can declare a variable by using <variable_name>=<value>. Note there are no spaces between the variable (my_variable), equals sign, and the value ("ggplot2").

my_variable="ggplot2"

echo "My favorite R package is ${my_variable}"
My favorite R package is ggplot2

Take a look at line 3 above. We expand the variable (that is, we substitute the actual variable) by using ${my_variable} in our echo statement.

In general, when expanding a variable in a quoted string, it is better to use ${my_variable} (the variable name in curly brackets). This is especially important when using the variable name as part of a string:

my_var="chr1"
echo "${my_var}_1.vcf.gz"
chr1_1.vcf.gz

If we didn’t use the braces here, like this:

echo "$my_var_1.vcf.gz"

Bash would look for the variable $my_var_1, which doesn’t exist. So use the curly braces {} when you expand variables. It’s safer overall.

There is an alternate method for variable expansion which we will use when we call a sub-shell - a shell within a shell, much like in our xargs command above. We need to use parentheses () to expand them within the sub-shell, but not the top-shell. We’ll use this when we process multiple files within a single worker.

7.6.1 basename can be very handy when on workers

If we are processing a bunch of files on a worker, we need a way to get the bare filename from a dxfuse path. We will take advantage of this when we run process multiple files on the worker.

For example:

basename /mnt/project/worker_scripts/srun-script.sh

This will return:

srun-script.sh

Which can be really handy when we name our outputs. This command is so handy it is used in WDL.

7.7 Quoting and Escaping Filenames in Bash

One point of confusion is when do you quote things in Bash? When do you use single quotes (') versus double-quotes (")? When do you use \ to escape characters?

Let’s talk about some quoting rules in Bash. I’ve tried to make things as simplified and generalized as possible, rather than stating all of the rules for each quote.

  1. If you have spaces in a filename, use double quotes ("chr 1.bam")
  2. If you have a single quote in the filename, use double quotes to wrap it ("ted's file.bam")
  3. Only escape characters when necessary - if you can solve a problem with quotes, use them
  4. If you need to preserve an escaped character, use single quotes

Let’s go over each of these with an example.

7.7.1 If you have spaces in a filename, use double quotes (Most common)

For example, if your filename is chr 1 file.bam, then use double quotes in your argument

samtools view -c "chr 1 file.bam"

7.7.2 If you have a single quote in the name, use double quotes to wrap it (less common)

Say you have a file called ted's new file.bam. This can be a problem when you are calling it, especially because of the single quote.

In this case, you can do this:

samtools view -c "ted's new file.bam"

7.7.3 Only escape characters when necessary (less common)

There are a number of special characters (such as Tab, and Newline) that can be specified as escape characters. In double quotes, characters such as $ are signals to Bash to expand or evaluate code.

Say that someone had a $ in their file name such as Thi$file is money.bam

How do we refer to it? We can escape the character with a backslash \:

samtools view -c "Thi\$file is money.bam"

The backslash is a clue to Bash that we don’t want variable expansion in this case. Without it, bash would look for a variable called $file.

7.7.4 If you need to preserve an escaped character, use single quotes (least common)

This is rarely used, but if you need to keep an escaped character in your filename, you can use single quotes. Say we have a filename called Thi\$file.bam and you need that backslash in the file name (btw, please don’t do this), you can use single quotes to preserve that backslash:

samtools view -c 'Thi\$file.bam'

Again, hopefully you won’t need this.

7.7.5 For More Info

https://www.grymoire.com/Unix/Quote.html#uh-3

What about backticks?

Backticks (`) are an old way to do command evaluation in Bash. For example, if we run the following on the command-line:

echo "there are `ls -l | wc -l` files in this directory"

Will produce:

there are       36 files in this directory

Their use is deprecated, so you should be using $() in your command evaluations instead:

echo "there are $(ls -l | wc -l) files in this directory"
What about X use case?

There are a lot of rules for Bash variable expansion and quoting that I don’t cover here. I try to show you a way to do things that work in multiple situations on the cluster.

That’s why I focus on double quotes for filenames and ${} for variable expansion in general. They will work whether your Bash script is on the command line or in an App, or in WDL.

7.8 What you learned in this chapter

Whew, this was a whirlwind tour. Keep this chapter in mind when you’re working on a cluster - the bash programming patterns will serve you well. We’ll refer to these patterns a lot when we get to doing more bioinformatics tasks on the cluster.

  • Setting up bash scripts with positional arguments
  • Iterating over a list of files using xargs
  • How to use bash variables and variable expansions