6  Hail Skills and Concepts

6.1 Learning Objectives

After reading this chapter, you should be able to:

  • Explain basic Hail terms and concepts, including Hail Tables and Hail MatrixTables
  • Monitor progress using the Spark UI
  • Execute Hail operations on Tables and MatrixTables, including annotating, filtering, and joining
  • Calculate Sample QC metrics and save as a Hail Table in DNAX

This chapter shows how to perform sample QC on a Hail MatrixTable as a pre-GWAS step and save it as a Table to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Note: For population scale data, samples may be referred to as individuals. In this chapter, the word “sample” will be used.

6.2 What is Hail?

Hail is a genomics framework built on top of Spark to allow for scalable genomics queries. It uses many of the same concepts (partitions, executors, plans) but it is genomics focused.

For example, the Hail MatrixFrame can be QC’ed, and then filtered on these QC metrics.

6.2.1 Vocabulary

DNAnexus Specific:

  • DNAX - S3 bucket for storing Apollo Databases and Tables - accessed with the dnax:// protocol
  • Database Object - data object on platform that represents a parquet database - has a unique identifier.

Hail Specific Terminology:

  • Spark Driver - the “boss” of the Spark cluster - hosts the Spark UI
  • Spark Worker - individual nodes that house executors
  • Hail Executor - individual cores/CPUs within a worker - Executors are assigned tasks, identical to Spark.
  • Data Partition - portion of the data that is accessed by worker in parquet columnar format
  • Key - unique identifier for rows or columns
    • row keys: rsid + alleles
    • column keys: sample IDs

In the rest of this chapter, we’ll discuss some of the operations that can be done on Hail Tables and MatrixTables, the two fundamental data structures of Hail.

6.3 Initialize Spark and Hail

Make sure to open up the Spark Interface at https://job-url:8081/jobs/

# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)
pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 2.4.4
SparkUI available at http://ip-172-31-34-162.ec2.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.78-b17627756568
LOGGING: writing to /opt/notebooks/hail-20230418-1956-0.2.78-b17627756568.log

Now we are ready to connect and do work with Hail.

Note that you should only initialize one time in a notebook. If you try to initialize multiple times, then you will get an error.

If that happens, restart the notebook kernel (Notebook > Restart Kernel in the menu) and start executing all over again.

6.4 Important Data Structures in Hail: Tables and MatrixTables

Before we get started with doing a GWAS in Hail, let’s talk about the data structures we’ll work with. Hail Tables and Hail MatrixTables.

One of the hardest things to understand about Hail data structures is that they are actually linked data structures. It’s easiest to think of them as data tables with attached metadata.

6.5 Hail Table

The fundamental data structure in Hail is the Hail Table.

I think of Hail Tables as a regular data table with a set of metadata fields. These metadata fields are known as global fields. You can think of them as annotations for the entire table, such as the instrument used to generate the table, the software version used to process the data, etc.

The tabular data itself are called row fields.

6.6 Load in a Table from Project Storage

The first thing we’ll do in hail is load a pheno file from project storage.

We’ll use hl.import_table() for this. We have to specify our key for the file. In our table, we have both fid (family ID) and iid (individual ID). We will use iid (individual ID) as our key.

# Import the pheno CSV file as a Hail Table

pheno_table = hl.import_table("file:///mnt/project/data/ukbb_100k_bmi_casecontrol.csv",
                              delimiter=',',
                              impute=True,
                              key='iid') # specify the column that will be the key (values must match what is in the MT 's' column)
2023-04-18 19:57:56 Hail: INFO: Reading table to impute column types
2023-04-18 19:58:01 Hail: INFO: Finished type imputation
  Loading field 'fid' as type str (imputed)
  Loading field 'iid' as type str (imputed)
  Loading field 'father_iid' as type int32 (imputed)
  Loading field 'mother_iid' as type int32 (imputed)
  Loading field 'sex_code' as type int32 (imputed)
  Loading field 'case_control_status' as type int32 (imputed)

6.6.1 Quick Hail Table EDA

We want to confirm that we loaded the data correctly. There are four really helpful methods to help with this:

  • .describe() - gives the fields and their data types for our Hail Table
  • .count() - counts the number of rows in our Hail Table
  • .show() - shows the first few rows of our dataset.
  • .summarize() - gives us information about missing values and data types.

Let’s start with .describe() on our loaded Pheno Table. This method will show the different fields associated with our Hail Table. Note that the actual table is referred to as Row fields, and you can see the names of the individual columns in the dataset.

pheno_table.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'fid': str 
    'iid': str 
    'father_iid': int32 
    'mother_iid': int32 
    'sex_code': int32 
    'case_control_status': int32 
----------------------------------------
Key: ['iid']
----------------------------------------

If we do a count, we can see that it counts the number of rows in our table. This can be really helpful in confirming the data was loaded correctly.

Keep in mind that with larger tables, the .count() operation can take some time to execute. That’s because it is a command that needs to be done on all of the partitions of the dataset.

pheno_table.count()
19884

.show() will show us the first few rows of our Hail Table. Similar to Spark, if we pass an integer to .show() (like .show(10)) it will show that number of rows instead.

pheno_table.show()
fid
iid
father_iid
mother_iid
sex_code
case_control_status
str str int32 int32 int32 int32
"sample_1_10" "sample_1_10" 0 0 1 2
"sample_1_10003" "sample_1_10003" 0 0 2 2
"sample_1_10014" "sample_1_10014" 0 0 2 1
"sample_1_10020" "sample_1_10020" 0 0 1 2
"sample_1_10021" "sample_1_10021" 0 0 2 1
"sample_1_10023" "sample_1_10023" 0 0 1 1
"sample_1_10036" "sample_1_10036" 0 0 2 1
"sample_1_10037" "sample_1_10037" 0 0 1 1
"sample_1_10047" "sample_1_10047" 0 0 1 1
"sample_1_10056" "sample_1_10056" 0 0 1 2

showing top 10 rows

Calling .summarize() on a Hail Table will give you an overview of missing values and data types for each of your columns in your Hail Table. This is always extremely helpful in order to confirm the data was loaded and parsed correctly.

pheno_table.summarize()

6.6.2 Diving into individual row fields

For our pheno_table, we can also show a column by using the . accessor with .show(). By default, it will show the column along wi†h the row key for the dataset (iid)

pheno_table.case_control_status.show()
iid
case_control_status
str int32
"sample_1_10" 2
"sample_1_10003" 2
"sample_1_10014" 1
"sample_1_10020" 2
"sample_1_10021" 1
"sample_1_10023" 1
"sample_1_10036" 1
"sample_1_10037" 1
"sample_1_10047" 1
"sample_1_10056" 2

showing top 10 rows

An additional method we can use on numeric columns is .summarize(), which will give a numeric summary of the column. Since we don’t have a numeric column, we’ll do it on the sex_code column:

pheno_table.sex_code.summarize()

19884 records.

sex_code (int32):
    Non-missing 19884 (100.00%)
    Missing 0
    Minimum 1
    Maximum 2
    Mean 1.54
    Std Dev 0.50

6.7 Transforming Hail Tables

There are multiple operations we can do on a Hail Table (let’s call our Hail Table as ht here):

  • ht.filter()
  • ht.annotate()

Let’s investigate these.

6.7.1 Specifying new columns using .annotate()

We can define new columns in our Hail Table using the .annotate() method. This is a lot like the mutate() function in R/Tidyverse. We can compute new values based on other columns in our dataset.

Here we’re subtracting 1 from both the sex_code and case_control_status columns so that we can use them as a phenotype and a covariate in our downstream analysis.

pheno_table2 = pheno_table.annotate(sc= pheno_table.sex_code -1, ccs=pheno_table.case_control_status -1)
pheno_table2.show()
2023-04-18 21:01:24 Hail: INFO: Coerced sorted dataset
fid
iid
father_iid
mother_iid
sex_code
case_control_status
sc
ccs
str str int32 int32 int32 int32 int32 int32
"sample_1_10" "sample_1_10" 0 0 1 2 0 1
"sample_1_10003" "sample_1_10003" 0 0 2 2 1 1
"sample_1_10014" "sample_1_10014" 0 0 2 1 1 0
"sample_1_10020" "sample_1_10020" 0 0 1 2 0 1
"sample_1_10021" "sample_1_10021" 0 0 2 1 1 0
"sample_1_10023" "sample_1_10023" 0 0 1 1 0 0
"sample_1_10036" "sample_1_10036" 0 0 2 1 1 0
"sample_1_10037" "sample_1_10037" 0 0 1 1 0 0
"sample_1_10047" "sample_1_10047" 0 0 1 1 0 0
"sample_1_10056" "sample_1_10056" 0 0 1 2 0 1

showing top 10 rows

6.7.2 Filtering our pheno_table using .filter()

The .filter() method is used a lot in Hail, for subsetting tables. We’ll leverage this ability to filter on our QC tables and then subset our MatrixTable with our filtered table. Here we will select those in the table that have a .sc of 0. We assign it into the pheno3 object.

pheno3 = pheno_table2.filter(pheno_table2.sc == 0)
pheno3.show()
2023-04-18 20:01:22 Hail: INFO: Coerced sorted dataset
fid
iid
father_iid
mother_iid
sex_code
case_control_status
sc
ccs
str str int32 int32 int32 int32 int32 int32
"sample_1_10" "sample_1_10" 0 0 1 2 0 1
"sample_1_10020" "sample_1_10020" 0 0 1 2 0 1
"sample_1_10023" "sample_1_10023" 0 0 1 1 0 0
"sample_1_10037" "sample_1_10037" 0 0 1 1 0 0
"sample_1_10047" "sample_1_10047" 0 0 1 1 0 0
"sample_1_10056" "sample_1_10056" 0 0 1 2 0 1
"sample_1_10074" "sample_1_10074" 0 0 1 2 0 1
"sample_1_10077" "sample_1_10077" 0 0 1 1 0 0
"sample_1_10082" "sample_1_10082" 0 0 1 1 0 0
"sample_1_10092" "sample_1_10092" 0 0 1 1 0 0

showing top 10 rows

6.8 Getting Categorical Breakdowns

We can count categories using pheno3.aggregate() and hl.agg.counter(). Here we’re getting the categorical breakdown of the recoded Case Constrol Status variable pheno3.ccs.

Keep in mind that these operations can be expensive to calculate in terms of CPU time. That doesn’t mean that you shouldn’t do it, but even getting descriptive statistics on a Hail Table can take some time.

In our case, our pheno3 Hail Table is pretty small, so we can go ahead and get information about the categorical breakdown for this variable.

pheno3.aggregate(hl.agg.counter(pheno3.ccs))
frozendict({0: 4637, 1: 4503})

6.9 Plotting the Hail Table

We can plot the Hail Table using hl.plot.histogram() on our columns of interest. There are a number of different kinds of plots in hl.plot.

The first thing we need to do is to load some modules from Bokeh, which will make our graphs more interactive.

from bokeh.io import output_notebook, show
output_notebook()
Loading BokehJS ...

Now we can use hl.plot.histogram() on columns in our Hail Table. We’ll see that the plotting works similarly for MatrixTables.

p = hl.plot.histogram(pheno_table2.ccs, title='Distribution of Cases/Controls', legend='ccs')
show(p)
p = hl.plot.histogram(pheno_table2.sc, range=(0,1), bins=100, title='Distribution of Gender', legend='sc')
show(p)

6.10 Working with MatrixTables

image.png

MatrixTables are an extension of HailTables. They work like a 2-D matrix with two tables that are attached to them:

  1. Columns fields (Sample Based Operations), accessed using .cols()

  2. Row fields (Variant based operations. accessed using .rows()

If you remember this, it makes MatrixTable operations more understandable.

We’ll load in a MatrixTable that I’ve already created. This was created from the pVCF files in the Geno_Data/ directory using the hl.import_vcf() function, and then written to a dnax database using mt.write(). (see supplementary_notebooks/01_import_pVCF.ipynb).

Note the url has the format dnax://database-id/mt-name, where database_id is the actual database id. You can find this from the database name in a project by the following:

# read MT
mt_url="dnax://database-GQYV2Bj04bPg9X3KfFyj2jyY/geno.mt"
mt = hl.read_matrix_table(mt_url)

We can see from using .describe() that there are Column and Row fields. There are a lot more row fields than column fields. When we merge in Pheno Data to our MatrixTable, we’ll see more column fields will be added by that operation.

# View structure of MT before QC

mt.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

We can see the actual genotype calls for each locus and each sample using mt.GT.show():

mt.GT.show()
'sample_1_0'
'sample_1_1'
'sample_1_2'
'sample_1_3'
locus
alleles
GT
GT
GT
GT
locus<GRCh38> array<str> call call call call
chr1:12198 ["G","C"] 0/1 0/0 1/1 0/0
chr1:12237 ["G","A"] 1/1 0/0 0/1 0/0
chr1:12259 ["G","C"] 1/1 0/0 0/1 0/0
chr1:12266 ["G","A"] 0/1 1/1 0/1 0/0
chr1:12272 ["G","A"] 0/0 0/1 0/0 0/1
chr1:12554 ["A","G"] 0/0 0/0 0/0 0/0
chr1:12559 ["G","A"] 0/0 0/0 0/0 0/0
chr1:12573 ["T","C"] 0/0 0/0 0/0 0/0
chr1:12586 ["C","T"] 0/0 0/0 0/0 0/0
chr1:12596 ["C","A"] 0/0 0/0 0/0 0/0

showing top 10 rows

showing the first 4 of 100000 columns

If we do a .show() operation on mt.row_key we will retrieve the keys used in the row operations. Here they are a combination of locus position and the allele at that position.

mt.row_key.show()
locus
alleles
locus<GRCh38> array<str>
chr1:12198 ["G","C"]
chr1:12237 ["G","A"]
chr1:12259 ["G","C"]
chr1:12266 ["G","A"]
chr1:12272 ["G","A"]
chr1:12554 ["A","G"]
chr1:12559 ["G","A"]
chr1:12573 ["T","C"]
chr1:12586 ["C","T"]
chr1:12596 ["C","A"]

showing top 10 rows

If we do a mt.rows().show(), we can see the row fields themselves. Remember our keys for this table are locus_id and allele.

mt.rows().show()

image.png

We can access a row field (here AF = Allele Frequency) by first going into the .info slot and then accessing the AF field within it. Keep this in mind for when you are using .filter_rows(), since it’s how you will accesss the fields.

mt.info.AF.show()

If we use mt.aggregate_entries() and hl.agg.counter() on mt.GT.n_alt_alleles(), we will get the distribution of zygosities for our entire MatrixTable.

mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles()))

6.11 Run Hail’s Sample QC method on a MatrixTable

If we use hl.sample_qc() on our MatrixTable, it will create a new MatrixTable that has the QC information attached. You’ll notice that we have an additional column field called mt.sample_qc. This is a nested column field - it will contain additional metrics within in.

Note that we’re not going to create the sample QC table here - we will load it from the DNAX store when we do our GWAS. This is because it takes some time to calculate. For more about sample QC, refer to the OpenBio Sample QC notebook here: https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/sample_qc.ipynb

Additional documentation: https://hail.is/docs/0.2/methods/genetics.html#hail.methods.sample_qc

# DON'T RUN in class
# Run sample-level QC

qc_mt = hl.sample_qc(mt)
# View structure of MT after QC
# Don't Run this cell

qc_mt.describe()
----------------------------------------

Global fields:

    None

----------------------------------------

Column fields:

    's': str

    'sample_qc': struct {

        dp_stats: struct {

            mean: float64, 

            stdev: float64, 

            min: float64, 

            max: float64

        }, 

        gq_stats: struct {

            mean: float64, 

            stdev: float64, 

            min: float64, 

            max: float64

        }, 

        call_rate: float64, 

        n_called: int64, 

        n_not_called: int64, 

        n_filtered: int64, 

        n_hom_ref: int64, 

        n_het: int64, 

        n_hom_var: int64, 

        n_non_ref: int64, 

        n_singleton: int64, 

        n_snp: int64, 

        n_insertion: int64, 

        n_deletion: int64, 

        n_transition: int64, 

        n_transversion: int64, 

        n_star: int64, 

        r_ti_tv: float64, 

        r_het_hom_var: float64, 

        r_insertion_deletion: float64

    }

----------------------------------------

Row fields:

    'locus': locus<GRCh38>

    'alleles': array<str>

    'rsid': str

    'qual': float64

    'filters': set<str>

    'info': struct {

        AC: array<int32>, 

        AF: array<float64>, 

        AN: int32, 

        BaseQRankSum: float64, 

        ClippingRankSum: float64, 

        DB: bool, 

        DP: int32, 

        FS: float64, 

        InbreedingCoeff: float64, 

        MQ: float64, 

        MQRankSum: float64, 

        QD: float64, 

        ReadPosRankSum: float64, 

        SOR: float64, 

        VQSLOD: float64, 

        VQSR_culprit: str, 

        VQSR_NEGATIVE_TRAIN_SITE: bool, 

        VQSR_POSITIVE_TRAIN_SITE: bool, 

        GQ_HIST_ALT: array<str>, 

        DP_HIST_ALT: array<str>, 

        AB_HIST_ALT: array<str>, 

        GQ_HIST_ALL: str, 

        DP_HIST_ALL: str, 

        AB_HIST_ALL: str, 

        AC_AFR: array<int32>, 

        AC_AMR: array<int32>, 

        AC_ASJ: array<int32>, 

        AC_EAS: array<int32>, 

        AC_FIN: array<int32>, 

        AC_NFE: array<int32>, 

        AC_OTH: array<int32>, 

        AC_SAS: array<int32>, 

        AC_Male: array<int32>, 

        AC_Female: array<int32>, 

        AN_AFR: int32, 

        AN_AMR: int32, 

        AN_ASJ: int32, 

        AN_EAS: int32, 

        AN_FIN: int32, 

        AN_NFE: int32, 

        AN_OTH: int32, 

        AN_SAS: int32, 

        AN_Male: int32, 

        AN_Female: int32, 

        AF_AFR: array<float64>, 

        AF_AMR: array<float64>, 

        AF_ASJ: array<float64>, 

        AF_EAS: array<float64>, 

        AF_FIN: array<float64>, 

        AF_NFE: array<float64>, 

        AF_OTH: array<float64>, 

        AF_SAS: array<float64>, 

        AF_Male: array<float64>, 

        AF_Female: array<float64>, 

        GC_AFR: array<int32>, 

        GC_AMR: array<int32>, 

        GC_ASJ: array<int32>, 

        GC_EAS: array<int32>, 

        GC_FIN: array<int32>, 

        GC_NFE: array<int32>, 

        GC_OTH: array<int32>, 

        GC_SAS: array<int32>, 

        GC_Male: array<int32>, 

        GC_Female: array<int32>, 

        AC_raw: array<int32>, 

        AN_raw: int32, 

        AF_raw: array<float64>, 

        GC_raw: array<int32>, 

        GC: array<int32>, 

        Hom_AFR: array<int32>, 

        Hom_AMR: array<int32>, 

        Hom_ASJ: array<int32>, 

        Hom_EAS: array<int32>, 

        Hom_FIN: array<int32>, 

        Hom_NFE: array<int32>, 

        Hom_OTH: array<int32>, 

        Hom_SAS: array<int32>, 

        Hom_Male: array<int32>, 

        Hom_Female: array<int32>, 

        Hom_raw: array<int32>, 

        Hom: array<int32>, 

        STAR_AC: int32, 

        STAR_AC_raw: int32, 

        STAR_Hom: int32, 

        POPMAX: array<str>, 

        AC_POPMAX: array<int32>, 

        AN_POPMAX: array<int32>, 

        AF_POPMAX: array<float64>, 

        DP_MEDIAN: array<int32>, 

        DREF_MEDIAN: array<float64>, 

        GQ_MEDIAN: array<int32>, 

        AB_MEDIAN: array<float64>, 

        AS_RF: array<float64>, 

        AS_FilterStatus: array<str>, 

        AS_RF_POSITIVE_TRAIN: array<int32>, 

        AS_RF_NEGATIVE_TRAIN: array<int32>, 

        AC_AFR_Male: array<int32>, 

        AC_AMR_Male: array<int32>, 

        AC_ASJ_Male: array<int32>, 

        AC_EAS_Male: array<int32>, 

        AC_FIN_Male: array<int32>, 

        AC_NFE_Male: array<int32>, 

        AC_OTH_Male: array<int32>, 

        AC_SAS_Male: array<int32>, 

        AC_AFR_Female: array<int32>, 

        AC_AMR_Female: array<int32>, 

        AC_ASJ_Female: array<int32>, 

        AC_EAS_Female: array<int32>, 

        AC_FIN_Female: array<int32>, 

        AC_NFE_Female: array<int32>, 

        AC_OTH_Female: array<int32>, 

        AC_SAS_Female: array<int32>, 

        AN_AFR_Male: int32, 

        AN_AMR_Male: int32, 

        AN_ASJ_Male: int32, 

        AN_EAS_Male: int32, 

        AN_FIN_Male: int32, 

        AN_NFE_Male: int32, 

        AN_OTH_Male: int32, 

        AN_SAS_Male: int32, 

        AN_AFR_Female: int32, 

        AN_AMR_Female: int32, 

        AN_ASJ_Female: int32, 

        AN_EAS_Female: int32, 

        AN_FIN_Female: int32, 

        AN_NFE_Female: int32, 

        AN_OTH_Female: int32, 

        AN_SAS_Female: int32, 

        AF_AFR_Male: array<float64>, 

        AF_AMR_Male: array<float64>, 

        AF_ASJ_Male: array<float64>, 

        AF_EAS_Male: array<float64>, 

        AF_FIN_Male: array<float64>, 

        AF_NFE_Male: array<float64>, 

        AF_OTH_Male: array<float64>, 

        AF_SAS_Male: array<float64>, 

        AF_AFR_Female: array<float64>, 

        AF_AMR_Female: array<float64>, 

        AF_ASJ_Female: array<float64>, 

        AF_EAS_Female: array<float64>, 

        AF_FIN_Female: array<float64>, 

        AF_NFE_Female: array<float64>, 

        AF_OTH_Female: array<float64>, 

        AF_SAS_Female: array<float64>, 

        GC_AFR_Male: array<int32>, 

        GC_AMR_Male: array<int32>, 

        GC_ASJ_Male: array<int32>, 

        GC_EAS_Male: array<int32>, 

        GC_FIN_Male: array<int32>, 

        GC_NFE_Male: array<int32>, 

        GC_OTH_Male: array<int32>, 

        GC_SAS_Male: array<int32>, 

        GC_AFR_Female: array<int32>, 

        GC_AMR_Female: array<int32>, 

        GC_ASJ_Female: array<int32>, 

        GC_EAS_Female: array<int32>, 

        GC_FIN_Female: array<int32>, 

        GC_NFE_Female: array<int32>, 

        GC_OTH_Female: array<int32>, 

        GC_SAS_Female: array<int32>, 

        Hemi_AFR: array<int32>, 

        Hemi_AMR: array<int32>, 

        Hemi_ASJ: array<int32>, 

        Hemi_EAS: array<int32>, 

        Hemi_FIN: array<int32>, 

        Hemi_NFE: array<int32>, 

        Hemi_OTH: array<int32>, 

        Hemi_SAS: array<int32>, 

        Hemi: array<int32>, 

        Hemi_raw: array<int32>, 

        STAR_Hemi: int32

    }

----------------------------------------

Entry fields:

    'GT': call

    'AD': array<int32>

    'DP': int32

    'GQ': int32

    'PL': array<int32>

----------------------------------------

Column key: ['s']

Row key: ['locus', 'alleles']

----------------------------------------

We can see that a new column field called ‘sample_qc’ has been added the MT. Note that the sample_qc field is not calculated yet. If we do a .show(), then we will calculate the QC metrics.

This step took a long time to calculate, over 20 minutes on a six node instance (mem2_ssd1_v2_x8), so we won’t calculate it.

%%timeit
qc_mt.cols().show()
2023-04-11 16:20:15 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.

    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
sample_qc
dp_stats
gq_stats
s
mean
stdev
min
max
mean
stdev
min
max
call_rate
n_called
n_not_called
n_filtered
n_hom_ref
n_het
n_hom_var
n_non_ref
n_singleton
n_snp
n_insertion
n_deletion
n_transition
n_transversion
n_star
r_ti_tv
r_het_hom_var
r_insertion_deletion
str float64 float64 float64 float64 float64 float64 float64 float64 float64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 int64 float64 float64 float64
"sample_1_0" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3369261 34252 0 3355042 9260 4959 14219 5 16921 997 1379 11724 5197 0 2.26e+00 1.87e+00 7.23e-01
"sample_1_1" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3370000 33513 0 3355843 9177 4980 14157 1 16877 949 1425 11640 5237 0 2.22e+00 1.84e+00 6.66e-01
"sample_1_10" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3371138 32375 0 3357104 9121 4913 14034 3 16724 982 1356 11596 5128 0 2.26e+00 1.86e+00 7.24e-01
"sample_1_100" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3369744 33769 0 3355741 9113 4890 14003 2 16635 1004 1368 11567 5068 0 2.28e+00 1.86e+00 7.34e-01
"sample_1_1000" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3369661 33852 0 3355634 9190 4837 14027 3 16604 995 1361 11540 5064 0 2.28e+00 1.90e+00 7.31e-01
"sample_1_10000" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3368966 34547 0 3354842 9237 4887 14124 0 16749 986 1387 11647 5102 0 2.28e+00 1.89e+00 7.11e-01
"sample_1_10001" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3368745 34768 0 3354682 9160 4903 14063 2 16732 1006 1337 11633 5099 0 2.28e+00 1.87e+00 7.52e-01
"sample_1_10002" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3368654 34859 0 3354586 9102 4966 14068 4 16792 973 1389 11634 5158 0 2.26e+00 1.83e+00 7.01e-01
"sample_1_10003" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3371010 32503 0 3356918 9221 4871 14092 3 16674 998 1413 11508 5166 0 2.23e+00 1.89e+00 7.06e-01
"sample_1_10004" NaN NaN NA NA NaN NaN NA NA 9.90e-01 3368964 34549 0 3354902 9205 4857 14062 2 16693 1029 1332 11563 5130 0 2.25e+00 1.90e+00 7.73e-01

showing top 10 rows

We see that there is a column in qc_mt.sample_qc called call_rate. We can access it using:

qc_mt.sample_qc.call_rate or qc_mt["sample_qc"]["call_rate"] if we want to do something with it.

6.12 4) Create Sample QC Table and save in Apollo Database

Now that we have calculated our sample qc metrics, we can save them as a separate table. We are mostly doing this for convenience and speed.

In our case, this has already been done and calculated, so we’re just showing the process of storing a MatrixTable into dnax.

Note we could also do mt.write("dnax://{db_name}/geno.mt") and store the MatrixTable into dnax as well. Using dnax is helpful because Hail can read the MatrixTable directly from dnax.

# Create Hail Table from MT
# note we use .cols() to access the column (sample) fields
qc_tb = qc_mt.cols()
# Define database and table name

# Note: It is recommended to only use lowercase letters for the database name.
# If uppercase lettering is used, the database name will be lowercased when creating the database.
db_name = "db_mt_test2"
tb_name = "sample_qc.ht"
# DON'T RUN in class
# Create database in DNAX

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()
CREATE DATABASE IF NOT EXISTS db_mt_test2 LOCATION 'dnax://'

++

||

++

++

# DON'T RUN in class
# Store Table in DNAXc
import dxpy

# find database ID of newly created database using a dxpy method
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{tb_name}"

# Before this step, the Hail Table is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.Table.html#hail.Table.write for additional documentation.
qc_tb.write(url) # Note: output should describe size of Table (i.e. number of rows, partitions)