1  Object Based File Systems

Project storage on the DNAnexus platform is object-based. This kind of filesystem is probably very different from other file systems you’ve experienced.

1.1 Who This Chapter is For

  • You are a bioinformatician used to working on path-based file systems and you find object based file systems confusing
  • You are a data manager or data engineer wanting to adapt your data policies to the DNAnexus platform.

1.2 Learning Objectives

The goal of this chapter is to show how the DNAnexus filesystem is different from your previous experiences, and to highlight ways to work successfully on the platform given these differences. Specifically you will be able to:

  1. Compare and contrast differences between object-based file systems and POSIX filesystems
  2. Explain the role of metadata and folders in organizing data and outputs
  3. Tag and utilize tags for selecting files and separating output from multiple jobs.

1.3 Terminology

There is a lot of terminology in talking about file systems. Check back here if you’re not sure of the definition of something.

  • POSIX filesystem - What we think of when we think of Linux filesystems. A set of standards that define how files are created and managed. POSIX filesystems have paths and files.
  • Data - The actual file contents. For example, for a CSV file, the actual header and rows correspond to the data.
  • Metadata - information that is not part of the data, but is associated with the data. For our CSV file, some examples of metadata are the file permissions (who can access the file), the creator, and the creation date.
  • Object - A contiguous set of memory that contains both the data and metadata. Has a unique identifier.
  • Unique Identifier - the actual “pointer” for accessing the file. Unique to the file object when it is created. Does not change for the entire lifecycle of an object.
  • Database Engine - software that allows for rapid searching and retrieving objects.

1.4 Review: POSIX Filesystems

You may be familiar with POSIX (Portable Operating System Interface) file systems as a way to organize your data in folders. The POSIX standard basically defines the filesystem API so that they will be compatible with other file systems.

In POSIX filesystems, the main way to find/refer to a file is through its path. What does this mean?

For example, if my file is called chr1.vcf.gz and it is located in the directory /Geno_Data, we’d refer to it using the full path:

/Geno_Data/chr1.vcf.gz

Paths are the main way we organize and group files for POSIX filesystems (even devices are mapped to paths, such as /dev/hdd1 for an external hard drive). This information is external to a file.

Most importantly, we use directories to organize and group files logically. For example, we might have our R script and the data it analyzes in the same directory.

For a file in a POSIX-based filesystem, the path needs to be unique. If they are not, there are transactional rules that are part of the POSIX specification that define when to replace that file with a new one, or to save both versions. For example, we can’t have two files named:

/Geno_Data/chr1.vcf.gz

In the same folder. That violates our ability to find a file in the system.

https://www.computerweekly.com/feature/Posix-vs-object-storage-How-much-longer-for-Posix https://grimoire.carcano.ch/blog/posix-compliant-filesystems/

1.5 Object Based Filesystems are different

In contrast, object-based filesystems do not organize data like a POSIX filesystem. Each file object (such as a csv file, or a BAM file) has a unique identifier that identifies the data object. This unique identifier (like file-aagejFHEJSEI) serves as the main pointer to locate the data object, rather than a path. We’ll see that we can define paths and folders for DNAnexus file objects, but they are stored as part of the metadata for file objects.

File objects also have metadata that can be attached to them. This metadata can be:

Metadata Type Example Code Example
ID project-XXXXX:file-YYYYYY dx mv file-YYYY raw_data/
name chr1.tar.gz dx find data --name chr1.tar.gz
Path /raw_data/chr1.tar.gz dx find data --path /raw_data/
Creation Date 2021-12-01 dx find data --created-after 2021-12-01
Tags report, run1 dx tag file-xxxxxx report
Properties eid = 1153xxx dx find data --property eid=1153xxx

On the DNAnexus platform, both the filename and its path are considered metadata. This metadata is considered as part of the object, along with the data portion of the object.

Importantly, folders are not considered objects on the platform - folders only exist within the metadata for file objects.

The other issue is that the metadata for an object has no requirements to be unique, unlike paths. Which means you can have duplicates with the same file name in the same folder.

I know, this can be very distressing for most people. You can have two objects with the same file name, but they are considered distinct objects because they have unique identifiers.

https://www.ibm.com/cloud/blog/object-vs-file-vs-block-storage

1.6 Comparing POSIX and Object Based File Systems

Concept POSIX File System Object-Based System
File ID Represented by Full Path Represented by Object ID
Storage Data with limited metadata Metadata+Data
Path/Filename Must be Unique Can be duplicated
Metadata Limited Rich, can be freely modified after creation

1.7 Tracing the journey of a file object onto the platform

When a file uploaded, file objects go through three stages before they are available. These stages are:

  1. Open - Files are still being transferred via dx upload or the Upload Agent ua.
  2. Closing - File objects stay in this state for no longer than 8-10 seconds.
  3. Closed - Files are now available to be utilized on the platform, either with an app, workflow, or downloaded.
% dx describe file-FpQKQpQ0Fgk3gQZz3gPXQj7x
Result 1:
ID                    file-FpQKQpQ0Fgk3gQZz3gPXQj7x
Class                 file
Project               project-GJ496B00FZfxPV5VG36FybvY
Folder                /data
Name                  NA12878.bai
State                 closed
Visibility            visible
Types                 -
Properties            -
Tags                  -
Outgoing links        -
Created               Wed Apr 22 17:59:22 2020
Created by            emiai
 via the job          job-FpQGX780FgkG4bGz86zZk04V
Last modified         Thu Oct 13 15:38:04 2022
Media type            application/octet-stream
archivalState         "live"
Size                  2.09 MB, sponsored by DNAnexus
cloudAccount          "cloudaccount-dnanexus"

1.8 Copying Files from One Project to Another

Copying has an important definition on the platform: it means copying a file from one project to another project. It doesn’t refer to duplicating a file within a project.

When we copy a file from one project to another, the process doesn’t make a new physical copy. The reference to the original file is copied. The data is identical and points to the same location on the platform, and the metadata is copied to the new project.

This is quite nice in that you are not doubly charged for storage on the platform.

One important point: once the metadata is copied into the new project, there is no syncing of the metadata between the two projects. User 1 is free to modify the metadata in Project A (the original project) and changes are not made to the metadata in Project B (the derived project).

What about cloning?

The documentation talks about cloning files into a different project - if you clone a file, it makes a new copy of the metadata. It does not create a new physical copy of the data.

Copying files actually uses the dx api file-XXXXX/clone functionality underneath.

Regions on the DNAnexus Platform

All projects live within a region. As DNAnexus is built on both AWS and Azure, they map to regions within these service providers. Check here for a table of how the DNAnexus region maps to the service providers.

For example, the UKB RAP region (eu-west-2) is unique and only limited to projects within the scope of UK Biobank.

https://documentation.dnanexus.com/developer/api/regions

1.8.1 Advantages of Object Based Filesystems

The DNAnexus platform is only one example of an object-based filesystem. Other examples include Amazon S3 (Simple Storage Service), Microsoft Azure Blob Storage, and Google Cloud Storage.

Why does the world run on object based filesystems? There are a lot of advantages.

  • Highly scalable. This is the main reason given for using an object-based system. Given that unique identifier, the data part of the object can be very large.
  • Cost Efficient. Object based storage is designed to be efficient with very large data objects.
  • Fast Retrieval. Object-based filesystems let us work with arbitrarily large file sizes, and we can actually stream files to and from workers. File reading is very fast from an object-based filesystem. This is especially helpful when we are working with large genomic files (such as pVCFs).
  • Improved search speed. You can attach various database engines to objects in storage and rapidly search through them. An example of such an engine is Snowflake.
  • File operations are simplified. Compared to POSIX filesystems, there are only a few object filesystem commands: PUT, GET, DELETE, POST, HEAD.

1.8.2 Disadvantages of Object Based Filesystems

Coming from folder-based/POSIX filesystems, it can be a bit of a mind-bender getting used to object-based filesystems. Some of the disadvantages of Object Based Filesystems include:

  • Objects are immutable. Once an object is created, you are not able to modify it or edit it in place. If you modify a file on a worker, you can’t overwrite the original file object. A new file object must be created. (If you work with JupyterLab and save a notebook, a new file object is created in the place of the old file object. The old file object persists in a folder called .Notebook_archive.)
  • You have to be careful when generating outputs. This can be difficult if you have to resume a set of failed jobs in a new run. You can end up with two different objects with the same filename, and it can be some work to disambiguate these objects.
  • It’s confusing. You can actually have two files with the same filename in the same folder, because it is part of the changeable metadata. Disambiguating these two files without using file-ids can be difficult. There are rules that govern this.
  • Metadata is much more important with file management. Seriously, use tags for everything, including jobs and files. It will make working with multiple files much easier. And if you are on UKB RAP, leverage the file property metadata (eid and field_id) to help you select the files you want to process.

In short, getting used to how object-based filesystems requires a bit of an adjustment on our parts. In many cases, the benefits outweigh the drawbacks.

1.9 Best Practice: Always add project IDs to your File IDs

In Bash for Bioinformatics, we already discovered one way of working with files on the platform: dx find data. This is our main tool for finding and selecting files with metadata.

When we work with files outside of our current project, we might reference it by a file-id. Using file IDs by themselves are a global operation and we need to be careful when we use this!

Why is this so? There is a search tree when we use a file ID that is not in our project and without a project context. The search over metadata is looking for a file object based on just file ID.

  1. Look for the file in the current project
  2. Look at all files across all other projects

If you want to use the platform effectively, you want to avoid #2 at all costs, especially when working with a lot of files. The metadata server will take much longer to find your files because it needs to scan more projects.

The lesson here is when using file-ids, it is safer to put the project-id in front of your file id such as below:

project-XXXXX:file-YYYYYYYYY

1.10 Best Practice: Batch Tagging

So, we need to tag files. Remember, we can leverage xargs for tagging multiple files. In this example, we pipe the output of dx find data into xargs to batch tag them.

dx find data --name "*.bam" --brief | xargs -I% sh -c "dx tag % 'bam'"

After we do this, we can check whether our operation was successful. We can run:

dx find data --tag bam --brief

And here is our response:

project-GJ496B00FZfxPV5VG36FybvY:file-BZ9YGzj0x05b66kqQv51011q
project-GJ496B00FZfxPV5VG36FybvY:file-BZ9YGpj0x05xKxZ42QPqZkJY
project-GJ496B00FZfxPV5VG36FybvY:file-BQbXVY0093Jk1KVY1J082y7v
project-GJ496B00FZfxPV5VG36FybvY:file-FpQKQk00FgkGV3Vb3jJ8xqGV

Using the dx find data/xargs/dx tag combination with various input parameters to dx find data such as --name, --created-before, --created-after, --class, will help us to batch tag files and other objects on the platform.

1.11 Best Practice: Use tags for archiving

Let’s do something concrete and useful in our project: tag files we no longer need for archiving.

Say there are files we want to archive. We can use dx tag or the DNAnexus UI to tag these files with a specific tag, such as to_archive. This can be done by users.

An administrator can then run a monthly job that archives the files with these tags first selecting these files with dx find files --tag to_archive and piping those file ids into a dx api <project-id> archive <file-id> command.

dx find data --tag "to_archive" --brief | xargs -I% sh -c "dx api <projectid> archive %"

1.12 What about dxFUSE?

(See Bash for Bioinformatics for a introduction to dxFUSE.)

You might ask about the role of dxFUSE with the Object Based Filesystem.

In short, dxFUSE makes the Object Based Filesystem of DNAnexus act like a POSIX filesystem. Specifically, if there are multiple objects with the same name within a folder, it provides a way to specify these objects using file paths.

One important thing to remember: dxFUSE is only mounted once when you start the JupyterLab App. You will have to use the dxfuse command to remount it to see any files you’ve added to project storage after opening JupyterLab.

1.13 Best Practice: Use dxFUSE file paths in your scripts

Given the ability to stream data from project storage, you might ask how to use this in your scripts.

You can specify file paths in a project using /mnt/project/ in your scripts.

Here’s a python example where we’re loading a file called penguins.csv that is in the /data/ folder in our project.

import pandas as pd
pd.read_csv("/mnt/project/data/penguins.csv")

1.13.1 dxFUSE: -limitedWrite

For the most part, dxFUSE is read-only from project storage. However, it has a -limitedWrite option that is mostly used from writing from the Hadoop File System (HDFS) on a Spark Cluster into project storage.

dxfuse -limitedWrite

Once you have remounted with -limitedWrite, you’ll be able to specify file:///mnt/project urls in your .write() command. This is especially useful when calculating and saving BGEN Index files.

We don’t recommend using this with files that are outside of HDFS. It is mostly helpful when you are working with Hail and Spark and want to get results out of the Spark Cluster.