8 Working with JSON on the DNAnexus Platform
You will not need to login to the platform for this chapter.
You’ll want to cd
into the JSON
folder in your project.
cd JSON/
You’ll also need to install jq
if it’s not yet on your system. If you’re on Ubuntu/WSL, I recommend installing via apt install
. If you’re on Mac, I recommend installing via brew install
.
You can check if jq
is already installed by typing
which jq
8.1 Learning Objectives
By the end of this chapter, you should be able to:
- Define and Explain what JSON is and its elements and structures
- Explain how JSON is used on the DNAnexus platform
- Explain the basic structure of a JSON file
- Generate JSON output from
dx find data
anddx find jobs
- Execute simple
jq
commands to extract information from a JSON file - Execute advanced
jq
filters using conditionals to process output fromdx find files
ordx find jobs
.
8.2 What is JSON?
JSON is short for JavaScript Object Notation. It is a format used for storing information on the web and for interacting with APIs.
8.3 How is JSON used on the DNAnexus Platform?
JSON is used in multiple ways on the DNAnexus Platform, including:
- Submitting Jobs with complex parameters/inputs
- Specifying parameters of an app or workflow (
dxapp.json
anddxworkflow.json
) - Output of commands such as
dx find data
ordx find jobs
with the--json
flag - Extracting environment variables from
dx env
Underneath it all, all interactions with the DNAnexus API server are JSON submissions.
You can see that JSON is used in many places on the DNAnexus platforms, and for many purposes. So having basic knowledge of JSON can be really helpful.
8.4 Elements of a JSON file
Here are the main elements of a JSON file:
- Key:Value Pair. Example:
"name": "Ted Laderas"
. In this example, our key is “name” and our value is “Ted Laderas” - List
[]
- a collection of values. All values have to be the same data type. Example:["mom", "dad"]
- Object
{}
- A collection of key/value pairs, enclosed with curly brackets ({}
)
Here’s the example we’re going to use. We’ll do most of our processing of JSON on our own machine.
#| eval: false
#| filename: "json_data/example.json"
{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
},
"stats_txt": {
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
},
"users": ["laderast", "ted", "tladeras"]
}
What does the names
value contain in the following JSON? Is it a list, object or key:value pair?
{
"names": ["Ted", "Lisa", "George"]
}
It is a list. We know this because the value contains a []
.
{
"names": ["Ted", "Lisa", "George"]
}
8.5 Nestedness
JSON wouldn’t be helpful if it were only limited to a single level or key:values. Values can be lists or objects as well. For example, in our example JSON, we can see that the value of report_html
is a JSON object:
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
}
The object is:
{
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
}
When we work with extracting information, we’ll have to take this nested structure in mind.
8.6 Outputting JSON with dx find
commands
We already encountered the dx find data
command, which we used in the batch processing chapter.
If we use the --json
option, then the file information will be outputted in json format. This command will return a list of JSON file objects.
For example:
#| eval: false
#| filename: 05-JSON/dx-find-data-json.sh
dx find data --path ted_demo:data/ --json
The output will look like this:
[
{
"project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
"id": "file-FvQGZb00bvyQXzG3250XGbgz",
"describe": {
"id": "file-FvQGZb00bvyQXzG3250XGbgz",
"project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
"class": "file",
"name": "small-celegans-sample.fastq",
"state": "closed",
"folder": "/json_data",
"modified": 1665003035646,
"size": 16801690
}
},
{
"project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
"id": "file-B5Q8z8V5g3bX5qQ9y9YQ006k",
"describe": {
"id": "file-B5Q8z8V5g3bX5qQ9y9YQ006k",
"project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
"class": "file",
"name": "NC_001422.fasta",
"state": "closed",
"folder": "/json_data",
"modified": 1665003035645,
"size": 5539
}
}
]
What is returned when we run this code? Is it a JSON object, or a list of JSON objects?
#| eval: false
#| filename: 05-JSON/dx-find-jobs-json.sh
dx find jobs --json
It’s hard to tell at first, but We are returning a list of JSON objects, each of which corresponds to a single job run within our project.
8.7 Learning jq
gradually
As you can see, JSON can be very complicated to process and extract information from, depending on how many levels you go deep in a JSON document. That’s why jq
exists
jq
is a utility that is made to process JSON. All jq
commands have this format:
#| eval: false
jq '<filter>' <JSON file>
Filters are the heart of processing data using jq
. They let you extract JSON values or keys and process them with conditionals to filter data down. For example, you can do something like the following:
- Select all elements where the job status is failed
- For each of these elements, output the job-status id
You can see how jq
can be extremely powerful.
You can also pipe JSON from standard output into jq
. This will be really helpful for us when we start using pipes of data files from dx find data
.
8.8 Our simplest filter: .
One of the biggest uses for jq
is for more readable formatting. Oftentimes, the JSON returned by an API call is really hard to read. It can be returned as a single line of text, and it is really hard for humans to see the actual structure of the JSON response.
If we run jq .
on a JSON file, we’ll see that it makes it much more readable.
#| eval: false
#| filename: JSON/jq-simple.sh
jq '.' json_data/example.json
8.9 Getting the keys
We can extract the keys from the top level JSON by using 'keys'
as our filter.
#| eval: false
#| filename: JSON/jq-keys.sh
jq 'keys' json_data/example.json
8.10 Extracting a value from a container: jq .report_html
So, say we want to extract the value from the report_html
key in the above.
We can specify the key that we’re interested in to extract the value from that key.
#| eval: false
#| filename: JSON/jq-report.sh
jq '.report_html' json_data/example.json
This is the JSON file we’re going to be working with, in json_data/example.json
.
#| eval: false
#| filename: "json_data/example.json"
{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
},
"stats_txt": {
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
},
"users": ["laderast", "ted", "tladeras"]
}
In your terminal, try out:
jq '.stats_txt' json_data/example.json
What do you return?
#| eval: false
#| filename: JSON/jq-stats-txt.sh
jq '.stats_txt' json_data/example.json
We’ll return the following JSON object, which contains a single key-value pair.
{
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
}
8.10.1 Going one level deeper
We can extract the actual value associated with the dnanexus_link
key within report_html
by chaining onto our filter:
#| eval: false
#| filename: JSON/jq-nested.sh
jq '.report_html.dnanexus_link' json_data/example.json
What is returned when you run this code?
#| eval: false
#| filename: JSON/jq-nested.sh
jq '.report_html.dnanexus_link' json_data/example.json
Running this command should return the value of dnanexus-link
within report_html
:
"file-G4x7GX80VBzQy64k4jzgjqgY"
8.11 Conditional Filters using jq
One natural use case for using jq
on the DNAnexus platform is to rerun failed jobs.
Failed jobs can occur when using normal priority, which focuses on using spot instances. So, if we ran a series of jobs, we would want to restart these failed jobs.
This is a bit of code that would allow us to select those jobs that have failed.
#| eval: false
#| filename: JSON/dx-find-jobs-jq-clone.sh
dx find jobs --json |\
jq '.[] | select (.state | contains("failed")) | .id' |\
xargs -I% sh -c "dx run --clone %"
The second line contains the jq
filter that does the magic. Remember, the filter is contained within the single quotes (''
).
The last line contains "dx run --clone %"
.
Let’s take apart the different parts of the jq
filter (Figure 8.1):
Note that the pipes in this filter apply only to the jq
filter, so don’t mix them up with the other pipes in our overall Bash statement.
The first part of the filter, .[]
, says that we want to process the list (remember, dx find jobs
returns a list of objects).
The second part of the filter, select (.state | contains('failed'))
will let us select objects in the list that have a state
of failed
. This list of objects is then passed on the next part of the filter.
The last part of the filter, .id
, returns the the file ids for our failed jobs.
This is a basic pattern for selecting objects that meet a criteria, and can be really helpful when you want more control of your batch processing.
How would you modify the code below to terminate all jobs that had state
running
using dx terminate
?
#| eval: false
dx find jobs --json |\
jq '.[] | select (.state | contains("failed")) | .id' |\
xargs -I% sh -c "dx run --clone %"
#| eval: false
dx find jobs --json | \
jq '.[] | select (.state | contains("running")) | .id' | \
xargs -I% sh -c "dx terminate %"
8.12 Using JSON as an Input
This section is made to help you in writing JSON files. If you build an app or a workflow, you will need to edit the dxapp.json
or dxworkflow.json
files to enable your executables to be runnable.
8.12.1 Writing and modifying JSON
I know that JSON is supposed to be human readable. However, there are a lot of little quibbles that don’t make it easily human writable.
I highly recommend using an editor such as VS Code, with the appropriate JSON plugin. A JSON Visualizer such as the JSON Crack Extension will be extremely helpful as well.
Using the visualizer plugin and this tutorial will help you write well formed JSON, and point out any issues you might have. It’s easy to misplace a comma, or a bracket, and this tool helps you write well-formed JSON.