dx find data --name "*.bam" --brief | xargs -I% sh -c "dx tag % 'bam'"
1 Object Based File Systems
Project storage on the DNAnexus platform is object-based. This kind of filesystem is probably very different from other file systems you’ve experienced.
1.1 Who This Chapter is For
- You are a bioinformatician used to working on path-based file systems and you find object based file systems confusing
- You are a data manager or data engineer wanting to adapt your data policies to the DNAnexus platform.
1.2 Learning Objectives
The goal of this chapter is to show how the DNAnexus filesystem is different from your previous experiences, and to highlight ways to work successfully on the platform given these differences. Specifically you will be able to:
- Compare and contrast differences between object-based file systems and POSIX filesystems
- Explain the role of metadata and folders in organizing data and outputs
- Tag and utilize tags for selecting files and separating output from multiple jobs.
1.3 Terminology
There is a lot of terminology in talking about file systems. Check back here if you’re not sure of the definition of something.
- POSIX filesystem - What we think of when we think of Linux filesystems. A set of standards that define how files are created and managed. POSIX filesystems have paths and files.
- Data - The actual file contents. For example, for a CSV file, the actual header and rows correspond to the data.
- Metadata - information that is not part of the data, but is associated with the data. For our CSV file, some examples of metadata are the file permissions (who can access the file), the creator, and the creation date.
- Object - A contiguous set of memory that contains both the data and metadata. Has a unique identifier.
- Unique Identifier - the actual “pointer” for accessing the file. Unique to the file object when it is created. Does not change for the entire lifecycle of an object.
- Database Engine - software that allows for rapid searching and retrieving objects.
1.4 Review: POSIX Filesystems
You may be familiar with POSIX (Portable Operating System Interface) file systems as a way to organize your data in folders. The POSIX standard basically defines the filesystem API so that they will be compatible with other file systems.
In POSIX filesystems, the main way to find/refer to a file is through its path. What does this mean?
For example, if my file is called chr1.vcf.gz
and it is located in the directory /Geno_Data
, we’d refer to it using the full path:
/Geno_Data/chr1.vcf.gz
Paths are the main way we organize and group files for POSIX filesystems (even devices are mapped to paths, such as /dev/hdd1
for an external hard drive). This information is external to a file.
Most importantly, we use directories to organize and group files logically. For example, we might have our R script and the data it analyzes in the same directory.
For a file in a POSIX-based filesystem, the path needs to be unique. If they are not, there are transactional rules that are part of the POSIX specification that define when to replace that file with a new one, or to save both versions. For example, we can’t have two files named:
/Geno_Data/chr1.vcf.gz
In the same folder. That violates our ability to find a file in the system.
https://www.computerweekly.com/feature/Posix-vs-object-storage-How-much-longer-for-Posix https://grimoire.carcano.ch/blog/posix-compliant-filesystems/
1.5 Object Based Filesystems are different
In contrast, object-based filesystems do not organize data like a POSIX filesystem. Each file object (such as a csv file, or a BAM file) has a unique identifier that identifies the data object. This unique identifier (like file-aagejFHEJSEI
) serves as the main pointer to locate the data object, rather than a path. We’ll see that we can define paths and folders for DNAnexus file objects, but they are stored as part of the metadata for file objects.
File objects also have metadata that can be attached to them. This metadata can be:
Metadata Type | Example | Code Example |
---|---|---|
ID | project-XXXXX:file-YYYYYY |
dx mv file-YYYY raw_data/ |
name | chr1.tar.gz |
dx find data --name chr1.tar.gz |
Path | /raw_data/chr1.tar.gz |
dx find data --path /raw_data/ |
Creation Date | 2021-12-01 |
dx find data --created-after 2021-12-01 |
Tags | report , run1 |
dx tag file-xxxxxx report |
Properties | eid = 1153xxx |
dx find data --property eid=1153xxx |
On the DNAnexus platform, both the filename and its path are considered metadata. This metadata is considered as part of the object, along with the data portion of the object.
Importantly, folders are not considered objects on the platform - folders only exist within the metadata for file objects.
The other issue is that the metadata for an object has no requirements to be unique, unlike paths. Which means you can have duplicates with the same file name in the same folder.
I know, this can be very distressing for most people. You can have two objects with the same file name, but they are considered distinct objects because they have unique identifiers.
https://www.ibm.com/cloud/blog/object-vs-file-vs-block-storage
1.6 Comparing POSIX and Object Based File Systems
Concept | POSIX File System | Object-Based System |
---|---|---|
File ID | Represented by Full Path | Represented by Object ID |
Storage | Data with limited metadata | Metadata+Data |
Path/Filename | Must be Unique | Can be duplicated |
Metadata | Limited | Rich, can be freely modified after creation |
1.7 Tracing the journey of a file object onto the platform
When a file uploaded, file objects go through three stages before they are available. These stages are:
- Open - Files are still being transferred via
dx upload
or the Upload Agentua
. - Closing - File objects stay in this state for no longer than 8-10 seconds.
- Closed - Files are now available to be utilized on the platform, either with an app, workflow, or downloaded.
% dx describe file-FpQKQpQ0Fgk3gQZz3gPXQj7x
Result 1:
ID file-FpQKQpQ0Fgk3gQZz3gPXQj7x
Class file
Project project-GJ496B00FZfxPV5VG36FybvY
Folder /data
Name NA12878.bai
State closed
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created Wed Apr 22 17:59:22 2020
Created by emiai
via the job job-FpQGX780FgkG4bGz86zZk04V
Last modified Thu Oct 13 15:38:04 2022
Media type application/octet-stream
archivalState "live"
Size 2.09 MB, sponsored by DNAnexus
cloudAccount "cloudaccount-dnanexus"
1.8 Copying Files from One Project to Another
Copying has an important definition on the platform: it means copying a file from one project to another project. It doesn’t refer to duplicating a file within a project.
When we copy a file from one project to another, the process doesn’t make a new physical copy. The reference to the original file is copied. The data is identical and points to the same location on the platform, and the metadata is copied to the new project.
This is quite nice in that you are not doubly charged for storage on the platform.
One important point: once the metadata is copied into the new project, there is no syncing of the metadata between the two projects. User 1 is free to modify the metadata in Project A (the original project) and changes are not made to the metadata in Project B (the derived project).
The documentation talks about cloning files into a different project - if you clone a file, it makes a new copy of the metadata. It does not create a new physical copy of the data.
Copying files actually uses the dx api file-XXXXX/clone
functionality underneath.
All projects live within a region. As DNAnexus is built on both AWS and Azure, they map to regions within these service providers. Check here for a table of how the DNAnexus region maps to the service providers.
For example, the UKB RAP region (eu-west-2
) is unique and only limited to projects within the scope of UK Biobank.
1.8.1 Advantages of Object Based Filesystems
The DNAnexus platform is only one example of an object-based filesystem. Other examples include Amazon S3 (Simple Storage Service), Microsoft Azure Blob Storage, and Google Cloud Storage.
Why does the world run on object based filesystems? There are a lot of advantages.
- Highly scalable. This is the main reason given for using an object-based system. Given that unique identifier, the data part of the object can be very large.
- Cost Efficient. Object based storage is designed to be efficient with very large data objects.
- Fast Retrieval. Object-based filesystems let us work with arbitrarily large file sizes, and we can actually stream files to and from workers. File reading is very fast from an object-based filesystem. This is especially helpful when we are working with large genomic files (such as pVCFs).
- Improved search speed. You can attach various database engines to objects in storage and rapidly search through them. An example of such an engine is Snowflake.
- File operations are simplified. Compared to POSIX filesystems, there are only a few object filesystem commands: PUT, GET, DELETE, POST, HEAD.
1.8.2 Disadvantages of Object Based Filesystems
Coming from folder-based/POSIX filesystems, it can be a bit of a mind-bender getting used to object-based filesystems. Some of the disadvantages of Object Based Filesystems include:
- Objects are immutable. Once an object is created, you are not able to modify it or edit it in place. If you modify a file on a worker, you can’t overwrite the original file object. A new file object must be created. (If you work with JupyterLab and save a notebook, a new file object is created in the place of the old file object. The old file object persists in a folder called
.Notebook_archive
.) - You have to be careful when generating outputs. This can be difficult if you have to resume a set of failed jobs in a new run. You can end up with two different objects with the same filename, and it can be some work to disambiguate these objects.
- It’s confusing. You can actually have two files with the same filename in the same folder, because it is part of the changeable metadata. Disambiguating these two files without using file-ids can be difficult. There are rules that govern this.
- Metadata is much more important with file management. Seriously, use tags for everything, including jobs and files. It will make working with multiple files much easier. And if you are on UKB RAP, leverage the file property metadata (
eid
andfield_id
) to help you select the files you want to process.
In short, getting used to how object-based filesystems requires a bit of an adjustment on our parts. In many cases, the benefits outweigh the drawbacks.
1.9 Best Practice: Always add project IDs to your File IDs
In Bash for Bioinformatics, we already discovered one way of working with files on the platform: dx find data
. This is our main tool for finding and selecting files with metadata.
When we work with files outside of our current project, we might reference it by a file-id. Using file IDs by themselves are a global operation and we need to be careful when we use this!
Why is this so? There is a search tree when we use a file ID that is not in our project and without a project context. The search over metadata is looking for a file object based on just file ID.
- Look for the file in the current project
- Look at all files across all other projects
If you want to use the platform effectively, you want to avoid #2 at all costs, especially when working with a lot of files. The metadata server will take much longer to find your files because it needs to scan more projects.
The lesson here is when using file-ids, it is safer to put the project-id in front of your file id such as below:
project-XXXXX:file-YYYYYYYYY
1.10 Best Practice: Batch Tagging
So, we need to tag files. Remember, we can leverage xargs
for tagging multiple files. In this example, we pipe the output of dx find data
into xargs
to batch tag them.
After we do this, we can check whether our operation was successful. We can run:
dx find data --tag bam --brief
And here is our response:
project-GJ496B00FZfxPV5VG36FybvY:file-BZ9YGzj0x05b66kqQv51011q
project-GJ496B00FZfxPV5VG36FybvY:file-BZ9YGpj0x05xKxZ42QPqZkJY
project-GJ496B00FZfxPV5VG36FybvY:file-BQbXVY0093Jk1KVY1J082y7v
project-GJ496B00FZfxPV5VG36FybvY:file-FpQKQk00FgkGV3Vb3jJ8xqGV
Using the dx find data/xargs/dx tag
combination with various input parameters to dx find data
such as --name
, --created-before
, --created-after
, --class
, will help us to batch tag files and other objects on the platform.
1.12 What about dxFUSE?
(See Bash for Bioinformatics for a introduction to dxFUSE.)
You might ask about the role of dxFUSE with the Object Based Filesystem.
In short, dxFUSE makes the Object Based Filesystem of DNAnexus act like a POSIX filesystem. Specifically, if there are multiple objects with the same name within a folder, it provides a way to specify these objects using file paths.
One important thing to remember: dxFUSE is only mounted once when you start the JupyterLab App. You will have to use the dxfuse
command to remount it to see any files you’ve added to project storage after opening JupyterLab.
1.13 Best Practice: Use dxFUSE file paths in your scripts
Given the ability to stream data from project storage, you might ask how to use this in your scripts.
You can specify file paths in a project using /mnt/project/
in your scripts.
Here’s a python example where we’re loading a file called penguins.csv
that is in the /data/
folder in our project.
import pandas as pd
pd.read_csv("/mnt/project/data/penguins.csv")
1.13.1 dxFUSE: -limitedWrite
For the most part, dxFUSE is read-only from project storage. However, it has a -limitedWrite
option that is mostly used from writing from the Hadoop File System (HDFS) on a Spark Cluster into project storage.
dxfuse -limitedWrite
Once you have remounted with -limitedWrite
, you’ll be able to specify file:///mnt/project
urls in your .write()
command. This is especially useful when calculating and saving BGEN Index files.
We don’t recommend using this with files that are outside of HDFS. It is mostly helpful when you are working with Hail and Spark and want to get results out of the Spark Cluster.