Bash for Bioinformatics

Author

Ted Laderas

Published

October 17, 2022

1 Introduction

Bash scripting is an essential skill in bioinformatics that we often expect bioinformaticians to have automatically learned. I think that this underestimates the difficulty of learning and applying Bash scripting.

This is a book that is meant to bring you (a budding bioinformaticist) beyond the foundational shell scripting skills learned from a shell scripting course such as the Software Carpentries Shell Course.

You might also be savvy with an on-premise High Performance Computing (HPC) cluster and are wondering how to transition to working in the cloud. We have an abbreviated path for you that can get you to running jobs in the cloud as quickly as possible.

Specifically, this book shows you a path to get started with reproducible cloud computing on the DNAnexus platform.

Our goal is to showcase the “glue” skills that help you do bioinformatics reproducibly in the cloud.

1.1 Learning Objectives for this Book

After reading and doing the exercises in this book, you should be able to:

  • Apply bash scripting to your own work
  • Articulate basic Cloud Computing concepts that apply to the DNAnexus platform
  • Leverage bash scripting and the dx-toolkit to execute jobs on the DNAnexus platform
  • Execute batch processing of multiple files in a project on the DNAnexus platform
  • Monitor, profile, terminate and retry jobs to optimize costs
  • Manage software dependencies reproducibly using container-based technologies such as Docker

1.2 Four Levels of Using DNAnexus

One way to approach learning DNAnexus is to think about the skills you need to process a number of files. Ben Busby has noted there are 4 main skill levels in processing files on the DNAnexus platform:

Level # of Files Skill
1 1 Interactive Analysis (Cloud Workstation, JupyterLab)
2 1-50 Files dx run, Swiss Army Knife
3 50-1000 Files Building your own apps
4 1000+ Files, multiple steps Using WDL (Workflow Description Language)

We’ll be covering mostly level 2, but you will have the skills to move on to Level 3.

The key is to gradually build on your skills.

1.3 What is not covered

  • Using Bash scripting in DNAnexus Apps and Workflows
  • Using Bash Scripting in Workflow Description Language (WDL)

As mentioned, these are advanced level topics. However, this book will provide an excellent foundation to effectively building apps and workflows on the DNAnexus platform.

This book is not meant to be a substitute for excellent books such as Data Science on the Command Line. This book focuses on the essential Bash shell skills that will help you on the DNAnexus platform.

1.4 Notes

This is a very opinionated journey through Bash shell scripting, workflow languages, and reproduciblity. This is written from the perspective of a user, and should not be considered as official DNAnexus documentation.

It is designed to build on each of the concepts in a gradual manner. Where possible, we link to the official DNAnexus documentation. It is not meant to be a replacement for the DNAnexus documentation.

At each step, you’ll be able to do useful things with your data. We will focus on skills and programming patterns that are useful.

1.5 Prerequisites

Before you tackle this book, you should be able to accomplish the following:

We recommend reviewing a course such as the Software Carpentry course for Shell Scripting before getting started with this book. The Missing Semester of your CS Education is another great introduction/resource.

1.6 About Me

My name is Ted Laderas. I have been an instructor and trainer for over 10 years, making complex technologies understandable by my students and a variety of audiences.

I am available for consulting for your training needs, whether it be for software you’ve developed or upskilling people at your organization. I specialize in the following topics:

  • Reproducible Bioinformatics Workflows
  • Data Science for Executives/Clinicians
  • Shiny and Interactive Dashboards
  • R Data Science Programming
  • Data Visualization/Storytelling
  • Python Data Science Programming

For more information, please see https://laderast.github.io/

1.7 Contributors

No one writes a book alone. This book comes from a lot of conversations with everyone at DNAnexus, including:

  • Allison Regier
  • Ben Busby
  • Anastazie Sedlakova
  • Scott Funkhouser
  • Stanley Lan
  • Ondrej Klempir
  • Branislav Slavik
  • David Stanek
  • Chai Fungtammasan

Thanks to the following readers for their corrections:

  • Joshua Shapiro (found errors in variable expansion text)
  • Alexander Moersburg (found error in cloud computing section)

1.8 Want to be a Contributor?

This is the first draft of this book. It’s not going to be perfect, and we need help. Specifically, we need help with testing the setup and the exercises.

If you have an problem, you can file it as an issue using this link.

In your issue, please note the following:

  • Your Name
  • What your issue was
  • Which section, and line you found problematic or wouldn’t run

If you’re quarto/GitHub savvy, you can fork and file a pull request for typos/edits. If you’re not, you can file an issue.

Just be aware that this is not my primary job - I’ll try to be as responsive as I can.

1.9 License

Creative Commons License
Bash for Bioinformatics by Ted Laderas is licensed under a Creative Commons Attribution 4.0 International License.
Based on a work at https://github.com/laderast/bash_for_bioinformatics.