6 Hail Skills and Concepts

6.1 Learning Objectives

After reading this chapter, you should be able to:

Explain basic Hail terms and concepts, including Hail Tables and Hail MatrixTables
Monitor progress using the Spark UI
Execute Hail operations on Tables and MatrixTables, including annotating, filtering, and joining
Calculate Sample QC metrics and save as a Hail Table in DNAX

This chapter shows how to perform sample QC on a Hail MatrixTable as a pre-GWAS step and save it as a Table to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Note: For population scale data, samples may be referred to as individuals. In this chapter, the word “sample” will be used.

6.2 What is Hail?

Hail is a genomics framework built on top of Spark to allow for scalable genomics queries. It uses many of the same concepts (partitions, executors, plans) but it is genomics focused.

For example, the Hail MatrixFrame can be QC’ed, and then filtered on these QC metrics.

6.2.1 Vocabulary

DNAnexus Specific:

DNAX - S3 bucket for storing Apollo Databases and Tables - accessed with the dnax:// protocol
Database Object - data object on platform that represents a parquet database - has a unique identifier.

Hail Specific Terminology:

Spark Driver - the “boss” of the Spark cluster - hosts the Spark UI
Spark Worker - individual nodes that house executors
Hail Executor - individual cores/CPUs within a worker - Executors are assigned tasks, identical to Spark.
Data Partition - portion of the data that is accessed by worker in parquet columnar format
Key - unique identifier for rows or columns
- row keys: rsid + alleles
- column keys: sample IDs

In the rest of this chapter, we’ll discuss some of the operations that can be done on Hail Tables and MatrixTables, the two fundamental data structures of Hail.

6.3 Initialize Spark and Hail

Make sure to open up the Spark Interface at https://job-url:8081/jobs/

# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 2.4.4
SparkUI available at http://ip-172-31-34-162.ec2.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.78-b17627756568
LOGGING: writing to /opt/notebooks/hail-20230418-1956-0.2.78-b17627756568.log

Now we are ready to connect and do work with Hail.

Note that you should only initialize one time in a notebook. If you try to initialize multiple times, then you will get an error.

If that happens, restart the notebook kernel (Notebook > Restart Kernel in the menu) and start executing all over again.

6.4 Important Data Structures in Hail: Tables and MatrixTables

Before we get started with doing a GWAS in Hail, let’s talk about the data structures we’ll work with. Hail Tables and Hail MatrixTables.

One of the hardest things to understand about Hail data structures is that they are actually linked data structures. It’s easiest to think of them as data tables with attached metadata.

6.5 Hail Table

The fundamental data structure in Hail is the Hail Table.

I think of Hail Tables as a regular data table with a set of metadata fields. These metadata fields are known as global fields. You can think of them as annotations for the entire table, such as the instrument used to generate the table, the software version used to process the data, etc.

The tabular data itself are called row fields.

6.6 Load in a Table from Project Storage

The first thing we’ll do in hail is load a pheno file from project storage.

We’ll use hl.import_table() for this. We have to specify our key for the file. In our table, we have both fid (family ID) and iid (individual ID). We will use iid (individual ID) as our key.

# Import the pheno CSV file as a Hail Table

pheno_table = hl.import_table("file:///mnt/project/data/ukbb_100k_bmi_casecontrol.csv",
                              delimiter=',',
                              impute=True,
                              key='iid') # specify the column that will be the key (values must match what is in the MT 's' column)

2023-04-18 19:57:56 Hail: INFO: Reading table to impute column types
2023-04-18 19:58:01 Hail: INFO: Finished type imputation
  Loading field 'fid' as type str (imputed)
  Loading field 'iid' as type str (imputed)
  Loading field 'father_iid' as type int32 (imputed)
  Loading field 'mother_iid' as type int32 (imputed)
  Loading field 'sex_code' as type int32 (imputed)
  Loading field 'case_control_status' as type int32 (imputed)

6.6.1 Quick Hail Table EDA

We want to confirm that we loaded the data correctly. There are four really helpful methods to help with this:

.describe() - gives the fields and their data types for our Hail Table
.count() - counts the number of rows in our Hail Table
.show() - shows the first few rows of our dataset.
.summarize() - gives us information about missing values and data types.

Let’s start with .describe() on our loaded Pheno Table. This method will show the different fields associated with our Hail Table. Note that the actual table is referred to as Row fields, and you can see the names of the individual columns in the dataset.

pheno_table.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'fid': str 
    'iid': str 
    'father_iid': int32 
    'mother_iid': int32 
    'sex_code': int32 
    'case_control_status': int32 
----------------------------------------
Key: ['iid']
----------------------------------------

If we do a count, we can see that it counts the number of rows in our table. This can be really helpful in confirming the data was loaded correctly.

Keep in mind that with larger tables, the .count() operation can take some time to execute. That’s because it is a command that needs to be done on all of the partitions of the dataset.

pheno_table.count()

.show() will show us the first few rows of our Hail Table. Similar to Spark, if we pass an integer to .show() (like .show(10)) it will show that number of rows instead.

pheno_table.show()


fid	iid	father_iid	mother_iid	sex_code	case_control_status
str	str	int32	int32	int32	int32
"sample_1_10"	"sample_1_10"	0	0	1	2
"sample_1_10003"	"sample_1_10003"	0	0	2	2
"sample_1_10014"	"sample_1_10014"	0	0	2	1
"sample_1_10020"	"sample_1_10020"	0	0	1	2
"sample_1_10021"	"sample_1_10021"	0	0	2	1
"sample_1_10023"	"sample_1_10023"	0	0	1	1
"sample_1_10036"	"sample_1_10036"	0	0	2	1
"sample_1_10037"	"sample_1_10037"	0	0	1	1
"sample_1_10047"	"sample_1_10047"	0	0	1	1
"sample_1_10056"	"sample_1_10056"	0	0	1	2

showing top 10 rows

Calling .summarize() on a Hail Table will give you an overview of missing values and data types for each of your columns in your Hail Table. This is always extremely helpful in order to confirm the data was loaded and parsed correctly.

pheno_table.summarize()

6.6.2 Diving into individual row fields

For our pheno_table, we can also show a column by using the . accessor with .show(). By default, it will show the column along wi†h the row key for the dataset (iid)

pheno_table.case_control_status.show()


iid	case_control_status
str	int32
"sample_1_10"	2
"sample_1_10003"	2
"sample_1_10014"	1
"sample_1_10020"	2
"sample_1_10021"	1
"sample_1_10023"	1
"sample_1_10036"	1
"sample_1_10037"	1
"sample_1_10047"	1
"sample_1_10056"	2

showing top 10 rows

An additional method we can use on numeric columns is .summarize(), which will give a numeric summary of the column. Since we don’t have a numeric column, we’ll do it on the sex_code column:

pheno_table.sex_code.summarize()

19884 records.

sex_code (int32):

Non-missing	19884 (100.00%)
Missing	0
Minimum	1
Maximum	2
Mean	1.54
Std Dev	0.50

6.7 Transforming Hail Tables

There are multiple operations we can do on a Hail Table (let’s call our Hail Table as ht here):

ht.filter()
ht.annotate()

Let’s investigate these.

6.7.1 Specifying new columns using .annotate()

We can define new columns in our Hail Table using the .annotate() method. This is a lot like the mutate() function in R/Tidyverse. We can compute new values based on other columns in our dataset.

Here we’re subtracting 1 from both the sex_code and case_control_status columns so that we can use them as a phenotype and a covariate in our downstream analysis.

pheno_table2 = pheno_table.annotate(sc= pheno_table.sex_code -1, ccs=pheno_table.case_control_status -1)
pheno_table2.show()

2023-04-18 21:01:24 Hail: INFO: Coerced sorted dataset


fid	iid	father_iid	mother_iid	sex_code	case_control_status	sc	ccs
str	str	int32	int32	int32	int32	int32	int32
"sample_1_10"	"sample_1_10"	0	0	1	2	0	1
"sample_1_10003"	"sample_1_10003"	0	0	2	2	1	1
"sample_1_10014"	"sample_1_10014"	0	0	2	1	1	0
"sample_1_10020"	"sample_1_10020"	0	0	1	2	0	1
"sample_1_10021"	"sample_1_10021"	0	0	2	1	1	0
"sample_1_10023"	"sample_1_10023"	0	0	1	1	0	0
"sample_1_10036"	"sample_1_10036"	0	0	2	1	1	0
"sample_1_10037"	"sample_1_10037"	0	0	1	1	0	0
"sample_1_10047"	"sample_1_10047"	0	0	1	1	0	0
"sample_1_10056"	"sample_1_10056"	0	0	1	2	0	1

showing top 10 rows

6.7.2 Filtering our pheno_table using `.filter()`

The .filter() method is used a lot in Hail, for subsetting tables. We’ll leverage this ability to filter on our QC tables and then subset our MatrixTable with our filtered table. Here we will select those in the table that have a .sc of 0. We assign it into the pheno3 object.

pheno3 = pheno_table2.filter(pheno_table2.sc == 0)
pheno3.show()

2023-04-18 20:01:22 Hail: INFO: Coerced sorted dataset


fid	iid	father_iid	mother_iid	sex_code	case_control_status	sc	ccs
str	str	int32	int32	int32	int32	int32	int32
"sample_1_10"	"sample_1_10"	0	0	1	2	0	1
"sample_1_10020"	"sample_1_10020"	0	0	1	2	0	1
"sample_1_10023"	"sample_1_10023"	0	0	1	1	0	0
"sample_1_10037"	"sample_1_10037"	0	0	1	1	0	0
"sample_1_10047"	"sample_1_10047"	0	0	1	1	0	0
"sample_1_10056"	"sample_1_10056"	0	0	1	2	0	1
"sample_1_10074"	"sample_1_10074"	0	0	1	2	0	1
"sample_1_10077"	"sample_1_10077"	0	0	1	1	0	0
"sample_1_10082"	"sample_1_10082"	0	0	1	1	0	0
"sample_1_10092"	"sample_1_10092"	0	0	1	1	0	0

showing top 10 rows

6.8 Getting Categorical Breakdowns

We can count categories using pheno3.aggregate() and hl.agg.counter(). Here we’re getting the categorical breakdown of the recoded Case Constrol Status variable pheno3.ccs.

Keep in mind that these operations can be expensive to calculate in terms of CPU time. That doesn’t mean that you shouldn’t do it, but even getting descriptive statistics on a Hail Table can take some time.

In our case, our pheno3 Hail Table is pretty small, so we can go ahead and get information about the categorical breakdown for this variable.

pheno3.aggregate(hl.agg.counter(pheno3.ccs))

frozendict({0: 4637, 1: 4503})

6.9 Plotting the Hail Table

We can plot the Hail Table using hl.plot.histogram() on our columns of interest. There are a number of different kinds of plots in hl.plot.

The first thing we need to do is to load some modules from Bokeh, which will make our graphs more interactive.

from bokeh.io import output_notebook, show
output_notebook()

Loading BokehJS ...

Now we can use hl.plot.histogram() on columns in our Hail Table. We’ll see that the plotting works similarly for MatrixTables.

p = hl.plot.histogram(pheno_table2.ccs, title='Distribution of Cases/Controls', legend='ccs')
show(p)

p = hl.plot.histogram(pheno_table2.sc, range=(0,1), bins=100, title='Distribution of Gender', legend='sc')
show(p)

6.10 Working with MatrixTables

MatrixTables are an extension of HailTables. They work like a 2-D matrix with two tables that are attached to them:

Columns fields (Sample Based Operations), accessed using .cols()
Row fields (Variant based operations. accessed using .rows()

If you remember this, it makes MatrixTable operations more understandable.

We’ll load in a MatrixTable that I’ve already created. This was created from the pVCF files in the Geno_Data/ directory using the hl.import_vcf() function, and then written to a dnax database using mt.write(). (see supplementary_notebooks/01_import_pVCF.ipynb).

Note the url has the format dnax://database-id/mt-name, where database_id is the actual database id. You can find this from the database name in a project by the following:

# read MT
mt_url="dnax://database-GQYV2Bj04bPg9X3KfFyj2jyY/geno.mt"
mt = hl.read_matrix_table(mt_url)

We can see from using .describe() that there are Column and Row fields. There are a lot more row fields than column fields. When we merge in Pheno Data to our MatrixTable, we’ll see more column fields will be added by that operation.

# View structure of MT before QC

mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

We can see the actual genotype calls for each locus and each sample using mt.GT.show():

mt.GT.show()


		'sample_1_0'	'sample_1_1'	'sample_1_2'	'sample_1_3'
locus	alleles	GT	GT	GT	GT
locus<GRCh38>	array<str>	call	call	call	call
chr1:12198	["G","C"]	0/1	0/0	1/1	0/0
chr1:12237	["G","A"]	1/1	0/0	0/1	0/0
chr1:12259	["G","C"]	1/1	0/0	0/1	0/0
chr1:12266	["G","A"]	0/1	1/1	0/1	0/0
chr1:12272	["G","A"]	0/0	0/1	0/0	0/1
chr1:12554	["A","G"]	0/0	0/0	0/0	0/0
chr1:12559	["G","A"]	0/0	0/0	0/0	0/0
chr1:12573	["T","C"]	0/0	0/0	0/0	0/0
chr1:12586	["C","T"]	0/0	0/0	0/0	0/0
chr1:12596	["C","A"]	0/0	0/0	0/0	0/0

showing top 10 rows

showing the first 4 of 100000 columns

If we do a .show() operation on mt.row_key we will retrieve the keys used in the row operations. Here they are a combination of locus position and the allele at that position.

mt.row_key.show()


locus	alleles
locus<GRCh38>	array<str>
chr1:12198	["G","C"]
chr1:12237	["G","A"]
chr1:12259	["G","C"]
chr1:12266	["G","A"]
chr1:12272	["G","A"]
chr1:12554	["A","G"]
chr1:12559	["G","A"]
chr1:12573	["T","C"]
chr1:12586	["C","T"]
chr1:12596	["C","A"]

showing top 10 rows

If we do a mt.rows().show(), we can see the row fields themselves. Remember our keys for this table are locus_id and allele.

mt.rows().show()

We can access a row field (here AF = Allele Frequency) by first going into the .info slot and then accessing the AF field within it. Keep this in mind for when you are using .filter_rows(), since it’s how you will accesss the fields.

mt.info.AF.show()

If we use mt.aggregate_entries() and hl.agg.counter() on mt.GT.n_alt_alleles(), we will get the distribution of zygosities for our entire MatrixTable.

mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles()))

6.11 Run Hail’s Sample QC method on a MatrixTable

If we use hl.sample_qc() on our MatrixTable, it will create a new MatrixTable that has the QC information attached. You’ll notice that we have an additional column field called mt.sample_qc. This is a nested column field - it will contain additional metrics within in.

Note that we’re not going to create the sample QC table here - we will load it from the DNAX store when we do our GWAS. This is because it takes some time to calculate. For more about sample QC, refer to the OpenBio Sample QC notebook here: https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/sample_qc.ipynb

Additional documentation: https://hail.is/docs/0.2/methods/genetics.html#hail.methods.sample_qc

# DON'T RUN in class
# Run sample-level QC

qc_mt = hl.sample_qc(mt)

# View structure of MT after QC
# Don't Run this cell

qc_mt.describe()

----------------------------------------

Global fields:

    None

----------------------------------------

Column fields:

    's': str

    'sample_qc': struct {

        dp_stats: struct {

            mean: float64, 

            stdev: float64, 

            min: float64, 

            max: float64

        }, 

        gq_stats: struct {

            mean: float64, 

            stdev: float64, 

            min: float64, 

            max: float64

        }, 

        call_rate: float64, 

        n_called: int64, 

        n_not_called: int64, 

        n_filtered: int64, 

        n_hom_ref: int64, 

        n_het: int64, 

        n_hom_var: int64, 

        n_non_ref: int64, 

        n_singleton: int64, 

        n_snp: int64, 

        n_insertion: int64, 

        n_deletion: int64, 

        n_transition: int64, 

        n_transversion: int64, 

        n_star: int64, 

        r_ti_tv: float64, 

        r_het_hom_var: float64, 

        r_insertion_deletion: float64

    }

----------------------------------------

Row fields:

    'locus': locus<GRCh38>

    'alleles': array<str>

    'rsid': str

    'qual': float64

    'filters': set<str>

    'info': struct {

        AC: array<int32>, 

        AF: array<float64>, 

        AN: int32, 

        BaseQRankSum: float64, 

        ClippingRankSum: float64, 

        DB: bool, 

        DP: int32, 

        FS: float64, 

        InbreedingCoeff: float64, 

        MQ: float64, 

        MQRankSum: float64, 

        QD: float64, 

        ReadPosRankSum: float64, 

        SOR: float64, 

        VQSLOD: float64, 

        VQSR_culprit: str, 

        VQSR_NEGATIVE_TRAIN_SITE: bool, 

        VQSR_POSITIVE_TRAIN_SITE: bool, 

        GQ_HIST_ALT: array<str>, 

        DP_HIST_ALT: array<str>, 

        AB_HIST_ALT: array<str>, 

        GQ_HIST_ALL: str, 

        DP_HIST_ALL: str, 

        AB_HIST_ALL: str, 

        AC_AFR: array<int32>, 

        AC_AMR: array<int32>, 

        AC_ASJ: array<int32>, 

        AC_EAS: array<int32>, 

        AC_FIN: array<int32>, 

        AC_NFE: array<int32>, 

        AC_OTH: array<int32>, 

        AC_SAS: array<int32>, 

        AC_Male: array<int32>, 

        AC_Female: array<int32>, 

        AN_AFR: int32, 

        AN_AMR: int32, 

        AN_ASJ: int32, 

        AN_EAS: int32, 

        AN_FIN: int32, 

        AN_NFE: int32, 

        AN_OTH: int32, 

        AN_SAS: int32, 

        AN_Male: int32, 

        AN_Female: int32, 

        AF_AFR: array<float64>, 

        AF_AMR: array<float64>, 

        AF_ASJ: array<float64>, 

        AF_EAS: array<float64>, 

        AF_FIN: array<float64>, 

        AF_NFE: array<float64>, 

        AF_OTH: array<float64>, 

        AF_SAS: array<float64>, 

        AF_Male: array<float64>, 

        AF_Female: array<float64>, 

        GC_AFR: array<int32>, 

        GC_AMR: array<int32>, 

        GC_ASJ: array<int32>, 

        GC_EAS: array<int32>, 

        GC_FIN: array<int32>, 

        GC_NFE: array<int32>, 

        GC_OTH: array<int32>, 

        GC_SAS: array<int32>, 

        GC_Male: array<int32>, 

        GC_Female: array<int32>, 

        AC_raw: array<int32>, 

        AN_raw: int32, 

        AF_raw: array<float64>, 

        GC_raw: array<int32>, 

        GC: array<int32>, 

        Hom_AFR: array<int32>, 

        Hom_AMR: array<int32>, 

        Hom_ASJ: array<int32>, 

        Hom_EAS: array<int32>, 

        Hom_FIN: array<int32>, 

        Hom_NFE: array<int32>, 

        Hom_OTH: array<int32>, 

        Hom_SAS: array<int32>, 

        Hom_Male: array<int32>, 

        Hom_Female: array<int32>, 

        Hom_raw: array<int32>, 

        Hom: array<int32>, 

        STAR_AC: int32, 

        STAR_AC_raw: int32, 

        STAR_Hom: int32, 

        POPMAX: array<str>, 

        AC_POPMAX: array<int32>, 

        AN_POPMAX: array<int32>, 

        AF_POPMAX: array<float64>, 

        DP_MEDIAN: array<int32>, 

        DREF_MEDIAN: array<float64>, 

        GQ_MEDIAN: array<int32>, 

        AB_MEDIAN: array<float64>, 

        AS_RF: array<float64>, 

        AS_FilterStatus: array<str>, 

        AS_RF_POSITIVE_TRAIN: array<int32>, 

        AS_RF_NEGATIVE_TRAIN: array<int32>, 

        AC_AFR_Male: array<int32>, 

        AC_AMR_Male: array<int32>, 

        AC_ASJ_Male: array<int32>, 

        AC_EAS_Male: array<int32>, 

        AC_FIN_Male: array<int32>, 

        AC_NFE_Male: array<int32>, 

        AC_OTH_Male: array<int32>, 

        AC_SAS_Male: array<int32>, 

        AC_AFR_Female: array<int32>, 

        AC_AMR_Female: array<int32>, 

        AC_ASJ_Female: array<int32>, 

        AC_EAS_Female: array<int32>, 

        AC_FIN_Female: array<int32>, 

        AC_NFE_Female: array<int32>, 

        AC_OTH_Female: array<int32>, 

        AC_SAS_Female: array<int32>, 

        AN_AFR_Male: int32, 

        AN_AMR_Male: int32, 

        AN_ASJ_Male: int32, 

        AN_EAS_Male: int32, 

        AN_FIN_Male: int32, 

        AN_NFE_Male: int32, 

        AN_OTH_Male: int32, 

        AN_SAS_Male: int32, 

        AN_AFR_Female: int32, 

        AN_AMR_Female: int32, 

        AN_ASJ_Female: int32, 

        AN_EAS_Female: int32, 

        AN_FIN_Female: int32, 

        AN_NFE_Female: int32, 

        AN_OTH_Female: int32, 

        AN_SAS_Female: int32, 

        AF_AFR_Male: array<float64>, 

        AF_AMR_Male: array<float64>, 

        AF_ASJ_Male: array<float64>, 

        AF_EAS_Male: array<float64>, 

        AF_FIN_Male: array<float64>, 

        AF_NFE_Male: array<float64>, 

        AF_OTH_Male: array<float64>, 

        AF_SAS_Male: array<float64>, 

        AF_AFR_Female: array<float64>, 

        AF_AMR_Female: array<float64>, 

        AF_ASJ_Female: array<float64>, 

        AF_EAS_Female: array<float64>, 

        AF_FIN_Female: array<float64>, 

        AF_NFE_Female: array<float64>, 

        AF_OTH_Female: array<float64>, 

        AF_SAS_Female: array<float64>, 

        GC_AFR_Male: array<int32>, 

        GC_AMR_Male: array<int32>, 

        GC_ASJ_Male: array<int32>, 

        GC_EAS_Male: array<int32>, 

        GC_FIN_Male: array<int32>, 

        GC_NFE_Male: array<int32>, 

        GC_OTH_Male: array<int32>, 

        GC_SAS_Male: array<int32>, 

        GC_AFR_Female: array<int32>, 

        GC_AMR_Female: array<int32>, 

        GC_ASJ_Female: array<int32>, 

        GC_EAS_Female: array<int32>, 

        GC_FIN_Female: array<int32>, 

        GC_NFE_Female: array<int32>, 

        GC_OTH_Female: array<int32>, 

        GC_SAS_Female: array<int32>, 

        Hemi_AFR: array<int32>, 

        Hemi_AMR: array<int32>, 

        Hemi_ASJ: array<int32>, 

        Hemi_EAS: array<int32>, 

        Hemi_FIN: array<int32>, 

        Hemi_NFE: array<int32>, 

        Hemi_OTH: array<int32>, 

        Hemi_SAS: array<int32>, 

        Hemi: array<int32>, 

        Hemi_raw: array<int32>, 

        STAR_Hemi: int32

    }

----------------------------------------

Entry fields:

    'GT': call

    'AD': array<int32>

    'DP': int32

    'GQ': int32

    'PL': array<int32>

----------------------------------------

Column key: ['s']

Row key: ['locus', 'alleles']

----------------------------------------

We can see that a new column field called ‘sample_qc’ has been added the MT. Note that the sample_qc field is not calculated yet. If we do a .show(), then we will calculate the QC metrics.

This step took a long time to calculate, over 20 minutes on a six node instance (mem2_ssd1_v2_x8), so we won’t calculate it.

%%timeit
qc_mt.cols().show()

2023-04-11 16:20:15 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.

    To preserve matrix table column order, first unkey columns with 'key_cols_by()'


	sample_qc
	dp_stats				gq_stats
s	mean	stdev	min	max	mean	stdev	min	max	call_rate	n_called	n_not_called	n_filtered	n_hom_ref	n_het	n_hom_var	n_non_ref	n_singleton	n_snp	n_insertion	n_deletion	n_transition	n_transversion	n_star	r_ti_tv	r_het_hom_var	r_insertion_deletion
str	float64	float64	float64	float64	float64	float64	float64	float64	float64	int64	int64	int64	int64	int64	int64	int64	int64	int64	int64	int64	int64	int64	int64	float64	float64	float64
"sample_1_0"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3369261	34252	0	3355042	9260	4959	14219	5	16921	997	1379	11724	5197	0	2.26e+00	1.87e+00	7.23e-01
"sample_1_1"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3370000	33513	0	3355843	9177	4980	14157	1	16877	949	1425	11640	5237	0	2.22e+00	1.84e+00	6.66e-01
"sample_1_10"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3371138	32375	0	3357104	9121	4913	14034	3	16724	982	1356	11596	5128	0	2.26e+00	1.86e+00	7.24e-01
"sample_1_100"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3369744	33769	0	3355741	9113	4890	14003	2	16635	1004	1368	11567	5068	0	2.28e+00	1.86e+00	7.34e-01
"sample_1_1000"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3369661	33852	0	3355634	9190	4837	14027	3	16604	995	1361	11540	5064	0	2.28e+00	1.90e+00	7.31e-01
"sample_1_10000"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3368966	34547	0	3354842	9237	4887	14124	0	16749	986	1387	11647	5102	0	2.28e+00	1.89e+00	7.11e-01
"sample_1_10001"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3368745	34768	0	3354682	9160	4903	14063	2	16732	1006	1337	11633	5099	0	2.28e+00	1.87e+00	7.52e-01
"sample_1_10002"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3368654	34859	0	3354586	9102	4966	14068	4	16792	973	1389	11634	5158	0	2.26e+00	1.83e+00	7.01e-01
"sample_1_10003"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3371010	32503	0	3356918	9221	4871	14092	3	16674	998	1413	11508	5166	0	2.23e+00	1.89e+00	7.06e-01
"sample_1_10004"	NaN	NaN	NA	NA	NaN	NaN	NA	NA	9.90e-01	3368964	34549	0	3354902	9205	4857	14062	2	16693	1029	1332	11563	5130	0	2.25e+00	1.90e+00	7.73e-01

showing top 10 rows

We see that there is a column in qc_mt.sample_qc called call_rate. We can access it using:

qc_mt.sample_qc.call_rate or qc_mt["sample_qc"]["call_rate"] if we want to do something with it.

6.12 4) Create Sample QC Table and save in Apollo Database

Now that we have calculated our sample qc metrics, we can save them as a separate table. We are mostly doing this for convenience and speed.

In our case, this has already been done and calculated, so we’re just showing the process of storing a MatrixTable into dnax.

Note we could also do mt.write("dnax://{db_name}/geno.mt") and store the MatrixTable into dnax as well. Using dnax is helpful because Hail can read the MatrixTable directly from dnax.

# Create Hail Table from MT
# note we use .cols() to access the column (sample) fields
qc_tb = qc_mt.cols()

# Define database and table name

# Note: It is recommended to only use lowercase letters for the database name.
# If uppercase lettering is used, the database name will be lowercased when creating the database.
db_name = "db_mt_test2"
tb_name = "sample_qc.ht"

# DON'T RUN in class
# Create database in DNAX

stmt = f"CREATE DATABASE IF NOT EXISTS {db_name} LOCATION 'dnax://'"
print(stmt)
spark.sql(stmt).show()

CREATE DATABASE IF NOT EXISTS db_mt_test2 LOCATION 'dnax://'

++

||

++

++

# DON'T RUN in class
# Store Table in DNAXc
import dxpy

# find database ID of newly created database using a dxpy method
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{tb_name}"

# Before this step, the Hail Table is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.Table.html#hail.Table.write for additional documentation.
qc_tb.write(url) # Note: output should describe size of Table (i.e. number of rows, partitions)

6.1 Learning Objectives

6.2 What is Hail?

6.2.1 Vocabulary

6.3 Initialize Spark and Hail

6.4 Important Data Structures in Hail: Tables and MatrixTables

6.5 Hail Table

6.6 Load in a Table from Project Storage

6.6.1 Quick Hail Table EDA

6.6.2 Diving into individual row fields

6.7 Transforming Hail Tables

6.7.1 Specifying new columns using .annotate()

6.7.2 Filtering our pheno_table using .filter()

6.8 Getting Categorical Breakdowns

6.9 Plotting the Hail Table

6.10 Working with MatrixTables

6.11 Run Hail’s Sample QC method on a MatrixTable

6.12 4) Create Sample QC Table and save in Apollo Database

6.7.2 Filtering our pheno_table using `.filter()`