7 GWAS using Hail

7.1 Learning Objectives of this Session

Import data into a Hail MatrixTable from pVCF and BGEN formats
Merge phenotypic data into the MatrixTable
QC and Filter Variant data for use in a GWAS using Hail
QC and Filter Sample data for use in GWAS with Hail
Run GWAS on a Hail MatrixTable
Annotate Hail MatrixTables with internal DNAnexus resources
Export results from Hail to CSV format

Suggested Configuration: mem2_ssd1_v1_x8 with at least 6 nodes (use Hail/VEP feature set)

This notebook shows how to perform a GWAS for 1 case–control trait using Firth’s logistic regression with Hail and save the results as a Hail Table to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Note: For population scale data, samples may be referred to as individuals. In this notebook, the word “sample” will be used.

Pre-conditions for running this notebook successfully:

There is an existing Hail MatrixTable in DNAX (see pVCF Import Notebook and BGEN Import Notebook)
There is a Sample QC Hail Table in DNAX (see https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/sample_qc.ipynb)
There is a Variant QC Hail Table in DNAX (see https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/locus_qc.ipynb)
There is phenotypic data for the samples

7.2 Basic GWAS process

This table shows the basic GWAS process using Hail. We will cover each of these in the sections below.

Step	Description	Code Example
0.	Initiate Spark and Hail	See Notebook
1.	Load pVCF/BGEN data, save as MatrixTable (MT) Section 7.4	`mt = hl.import_vcf(path)` or `mt = hl.import_bgen(path)`
2.	Load Pheno file, merge with MatrixTable Section 7.5	`phenogeno_mt = mt.annotate_cols(**pheno_table[mt.s])`
3.	Build Sample QC table from MT, use to filter MT Section 7.6	`sample_qc_tb = hl.sample_qc(mt);` `qc_mt = phenogeno_mt.semi_join_rows(locus_qc_tb)`
4.	Build Locus QC table from MT, use to filter MT Section 7.7	`locus_qc_tb = hl.locus_qc(mt);` `qc_mt = qc_mt.semi_join_cols(sample_qc_tb)`
5.	Run GWAS Section 7.8	`gwas_tb = hl.logistic_regression.rows()`
6.	Visualize Results Section 7.9	`manhattan_plot = hl.plot.manhattan(gwas_tb.p_value)`; `show(manhattan_plot)`
7.	Annotate Results with VEP or annotation db Section 7.10	`ann_gwas_tb = db.annotate_rows_db(gwas_tb, 'gencode'`
8.	Save results to CSV, export chromosomes as BGEN file	See Notebook

7.3 Initialize Spark and Hail

Once you run the code below, make sure to open up another tab at https://JOB-URL:8081/jobs to look at the Spark UI. It is extremely helpful in understanding what’s going on behind the scenes.

# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

db_name = "db_mt_test2"

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 2.4.4
SparkUI available at http://ip-172-31-34-200.ec2.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.78-b17627756568
LOGGING: writing to /opt/notebooks/hail-20230419-2230-0.2.78-b17627756568.log

7.4 Read Genotype MatrixTable

We’ll read in the MatrixTable we loaded in from the pVCF files.

See (supplementary_notebooks/01_pVCF_import.ipynb for more info)

This was the code used to load the VCF files:

path = "file:///mnt/project/Geno_Data/*.vcf.gz"

mt = hl.import_vcf(path, force_bgz=True,
                   reference_genome="GRCh38",
                   array_elements_required=False)

# read MT
mt_url="dnax://database-GQYV2Bj04bPg9X3KfFyj2jyY/geno.mt"
mt = hl.read_matrix_table(mt_url)

We can look at the row fields (Variant metadata) using .rows().show():

mt.rows().show()


					info
locus	alleles	rsid	qual	filters	AC	AF	AN	BaseQRankSum	ClippingRankSum	DB	DP	FS	InbreedingCoeff	MQ	MQRankSum	QD	ReadPosRankSum	SOR	VQSLOD	VQSR_culprit	VQSR_NEGATIVE_TRAIN_SITE	VQSR_POSITIVE_TRAIN_SITE	GQ_HIST_ALT	DP_HIST_ALT	AB_HIST_ALT	GQ_HIST_ALL	DP_HIST_ALL	AB_HIST_ALL	AC_AFR	AC_AMR	AC_ASJ	AC_EAS	AC_FIN	AC_NFE	AC_OTH	AC_SAS	AC_Male	AC_Female	AN_AFR	AN_AMR	AN_ASJ	AN_EAS	AN_FIN	AN_NFE	AN_OTH	AN_SAS	AN_Male	AN_Female	AF_AFR	AF_AMR	AF_ASJ	AF_EAS	AF_FIN	AF_NFE	AF_OTH	AF_SAS	AF_Male	AF_Female	GC_AFR	GC_AMR	GC_ASJ	GC_EAS	GC_FIN	GC_NFE	GC_OTH	GC_SAS	GC_Male	GC_Female	AC_raw	AN_raw	AF_raw	GC_raw	GC	Hom_AFR	Hom_AMR	Hom_ASJ	Hom_EAS	Hom_FIN	Hom_NFE	Hom_OTH	Hom_SAS	Hom_Male	Hom_Female	Hom_raw	Hom	STAR_AC	STAR_AC_raw	STAR_Hom	POPMAX	AC_POPMAX	AN_POPMAX	AF_POPMAX	DP_MEDIAN	DREF_MEDIAN	GQ_MEDIAN	AB_MEDIAN	AS_RF	AS_FilterStatus	AS_RF_POSITIVE_TRAIN	AS_RF_NEGATIVE_TRAIN	AC_AFR_Male	AC_AMR_Male	AC_ASJ_Male	AC_EAS_Male	AC_FIN_Male	AC_NFE_Male	AC_OTH_Male	AC_SAS_Male	AC_AFR_Female	AC_AMR_Female	AC_ASJ_Female	AC_EAS_Female	AC_FIN_Female	AC_NFE_Female	AC_OTH_Female	AC_SAS_Female	AN_AFR_Male	AN_AMR_Male	AN_ASJ_Male	AN_EAS_Male	AN_FIN_Male	AN_NFE_Male	AN_OTH_Male	AN_SAS_Male	AN_AFR_Female	AN_AMR_Female	AN_ASJ_Female	AN_EAS_Female	AN_FIN_Female	AN_NFE_Female	AN_OTH_Female	AN_SAS_Female	AF_AFR_Male	AF_AMR_Male	AF_ASJ_Male	AF_EAS_Male	AF_FIN_Male	AF_NFE_Male	AF_OTH_Male	AF_SAS_Male	AF_AFR_Female	AF_AMR_Female	AF_ASJ_Female	AF_EAS_Female	AF_FIN_Female	AF_NFE_Female	AF_OTH_Female	AF_SAS_Female	GC_AFR_Male	GC_AMR_Male	GC_ASJ_Male	GC_EAS_Male	GC_FIN_Male	GC_NFE_Male	GC_OTH_Male	GC_SAS_Male	GC_AFR_Female	GC_AMR_Female	GC_ASJ_Female	GC_EAS_Female	GC_FIN_Female	GC_NFE_Female	GC_OTH_Female	GC_SAS_Female	Hemi_AFR	Hemi_AMR	Hemi_ASJ	Hemi_EAS	Hemi_FIN	Hemi_NFE	Hemi_OTH	Hemi_SAS	Hemi	Hemi_raw	STAR_Hemi
locus<GRCh38>	array<str>	str	float64	set<str>	array<int32>	array<float64>	int32	float64	float64	bool	int32	float64	float64	float64	float64	float64	float64	float64	float64	str	bool	bool	array<str>	array<str>	array<str>	str	str	str	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	int32	int32	int32	int32	int32	int32	int32	int32	int32	int32	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	int32	array<float64>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	int32	int32	int32	array<str>	array<int32>	array<int32>	array<float64>	array<int32>	array<float64>	array<int32>	array<float64>	array<float64>	array<str>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	int32	int32	int32	int32	int32	int32	int32	int32	int32	int32	int32	int32	int32	int32	int32	int32	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<float64>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	array<int32>	int32
chr1:12198	["G","C"]	"chr1_12198_G_C"	9.88e+03	{"AC0"}	[0]	NA	0	0.00e+00	3.58e-01	False	9204	0.00e+00	9.80e-03	2.30e+01	7.36e-01	1.40e+01	7.36e-01	3.02e-01	1.01e+00	"MQ"	False	False	["14\|79\|1\|25\|7\|4\|0\|5\|2\|0\|0\|0\|1\|0\|0\|0\|0\|0\|0\|0"]	["131\|7\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|0\|1\|0\|2\|0\|2\|0\|10\|0\|1\|28\|0\|3\|0\|0\|0\|0"]	"1859\|505\|25\|28\|8\|4\|0\|5\|2\|0\|0\|0\|1\|0\|0\|0\|0\|0\|0\|0"	"2413\|24\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|0\|1\|0\|2\|0\|2\|0\|10\|0\|1\|28\|0\|3\|0\|0\|0\|0"	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	0	0	0	0	0	0	0	0	0	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[229]	4874	[4.70e-02]	[2299,47,91]	[0,0,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[91]	[0]	NA	NA	NA	NA	NA	NA	NA	[2]	[4.01e-06]	[6]	[6.67e-01]	[1.27e-01]	["AC0"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12237	["G","A"]	"chr1_12237_G_A"	8.20e+01	{"RF","AC0"}	[0]	NA	0	7.36e-01	-3.58e-01	False	16096	0.00e+00	-1.44e-01	2.20e+01	-3.58e-01	4.31e+00	3.58e-01	1.29e+00	1.57e+00	"QD"	False	False	["0\|1\|0\|1\|1\|0\|0\|0\|2\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["3\|2\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|1\|0\|2\|1\|0\|0\|0\|0\|0\|0"]	"3537\|1722\|177\|100\|15\|1\|0\|0\|2\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"5428\|126\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|1\|0\|2\|1\|0\|0\|0\|0\|0\|0"	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	0	0	0	0	0	0	0	0	0	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[6]	11108	[5.40e-04]	[5549,4,1]	[0,0,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[1]	[0]	NA	NA	NA	NA	NA	NA	NA	[4]	[4.01e-06]	[22]	[6.00e-01]	[8.80e-02]	["RF\|AC0"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12259	["G","C"]	"chr1_12259_G_C"	3.74e+01	{"RF","AC0"}	[0]	[0.00e+00]	2	NA	NA	False	18814	0.00e+00	-1.42e-01	2.00e+01	NA	9.36e+00	NA	6.93e-01	2.24e+00	"MQ"	False	False	["0\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	"3885\|1993\|253\|143\|22\|3\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"6117\|182\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	0	0	0	0	0	0	0	2	2	0	NA	NA	NA	NA	NA	NA	NA	[0.00e+00]	[0.00e+00]	NA	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[1,0,0]	[1,0,0]	[0,0,0]	[2]	12600	[1.59e-04]	[6299,0,1]	[1,0,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[1]	[0]	NA	NA	NA	NA	NA	NA	NA	[2]	[1.01e-05]	[6]	[4.62e-01]	[4.93e-02]	["RF\|AC0"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12266	["G","A"]	"chr1_12266_G_A"	2.72e+03	{"AC0"}	[0]	NA	0	8.04e-01	0.00e+00	False	18372	0.00e+00	-1.38e-01	2.23e+01	-7.27e-01	1.61e+01	4.06e-01	2.03e-01	1.27e+00	"MQ"	False	False	["0\|15\|0\|0\|0\|2\|0\|2\|5\|0\|1\|1\|1\|3\|0\|6\|0\|0\|0\|0"]	["33\|3\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|0\|0\|2\|0\|0\|2\|0\|9\|0\|0\|8\|0\|0\|0\|0\|0\|0"]	"3755\|1684\|188\|101\|11\|4\|0\|2\|5\|0\|1\|1\|1\|3\|0\|6\|0\|0\|0\|0"	"5598\|163\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|0\|0\|2\|0\|0\|2\|0\|9\|0\|0\|8\|0\|0\|0\|0\|0\|0"	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	0	0	0	0	0	0	0	0	0	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[51]	11524	[4.43e-03]	[5726,21,15]	[0,0,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[15]	[0]	NA	NA	NA	NA	NA	NA	NA	[3]	[1.19e-08]	[39]	[5.00e-01]	[2.37e-01]	["AC0"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12272	["G","A"]	"chr1_12272_G_A"	2.71e+03	{"AC0"}	[0]	[0.00e+00]	2	8.04e-01	3.58e-01	False	17685	0.00e+00	-1.34e-01	2.23e+01	0.00e+00	1.69e+01	0.00e+00	1.96e-01	3.60e-02	"MQ"	False	False	["0\|14\|0\|0\|0\|2\|0\|2\|5\|0\|1\|1\|1\|3\|0\|6\|0\|0\|0\|0"]	["31\|4\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|0\|0\|0\|1\|0\|3\|0\|10\|0\|0\|7\|0\|0\|0\|0\|0\|0"]	"3715\|1603\|160\|87\|9\|2\|1\|2\|5\|0\|1\|1\|1\|3\|0\|6\|0\|0\|0\|0"	"5448\|147\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|0\|0\|0\|1\|0\|3\|0\|10\|0\|0\|7\|0\|0\|0\|0\|0\|0"	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	0	2	0	0	0	0	0	0	0	2	NA	[0.00e+00]	NA	NA	NA	NA	NA	NA	NA	[0.00e+00]	[0,0,0]	[1,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[0,0,0]	[1,0,0]	[49]	11192	[4.38e-03]	[5561,21,14]	[1,0,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[14]	[0]	NA	NA	NA	NA	NA	NA	NA	[3]	[7.94e-09]	[39]	[5.00e-01]	[2.32e-01]	["AC0"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12554	["A","G"]	"chr1_12554_A_G"	6.81e+01	{"RF","AC0"}	[0]	[0.00e+00]	3022	1.06e+00	0.00e+00	False	138436	0.00e+00	-8.07e-02	2.25e+01	3.22e-01	6.20e-01	-6.70e-02	1.88e-01	1.94e+00	"QD"	False	False	["0\|2\|0\|1\|1\|1\|3\|1\|2\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["3\|6\|2\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|2\|2\|2\|0\|0\|2\|0\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	"4802\|6590\|2323\|3118\|1771\|582\|666\|341\|111\|143\|70\|21\|22\|8\|10\|4\|4\|0\|0\|0"	"13350\|5670\|1233\|271\|48\|14\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|2\|2\|2\|0\|0\|2\|0\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0"	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	374	564	22	378	16	542	84	1042	1786	1236	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[187,0,0]	[282,0,0]	[11,0,0]	[189,0,0]	[8,0,0]	[271,0,0]	[42,0,0]	[521,0,0]	[893,0,0]	[618,0,0]	[13]	41172	[3.16e-04]	[20575,9,2]	[1511,0,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[2]	[0]	NA	NA	NA	NA	NA	NA	NA	[7]	[3.98e-04]	[32]	[2.86e-01]	[8.10e-03]	["RF\|AC0"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12559	["G","A"]	"chr1_12559_G_A"	1.67e+03	{"RF"}	[15]	[5.08e-03]	2952	1.50e+00	-1.18e-01	False	139640	0.00e+00	-8.47e-02	2.24e+01	-1.98e-01	1.65e+00	0.00e+00	1.74e-01	2.13e+00	"QD"	False	False	["3\|8\|7\|16\|9\|13\|13\|23\|10\|3\|1\|4\|2\|0\|1\|0\|0\|0\|0\|2"]	["30\|56\|28\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|5\|10\|17\|24\|12\|4\|12\|0\|11\|0\|0\|9\|0\|2\|0\|0\|0\|0"]	"5100\|6458\|2253\|3094\|1723\|523\|669\|372\|104\|139\|59\|21\|24\|5\|9\|1\|3\|0\|0\|2"	"13224\|5716\|1280\|271\|52\|16\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|5\|10\|17\|24\|12\|4\|12\|0\|11\|0\|0\|9\|0\|2\|0\|0\|0\|0"	[0]	[3]	[0]	[1]	[0]	[1]	[1]	[9]	[9]	[6]	426	552	22	376	18	564	82	912	1742	1210	[0.00e+00]	[5.43e-03]	[0.00e+00]	[2.66e-03]	[0.00e+00]	[1.77e-03]	[1.22e-02]	[9.87e-03]	[5.17e-03]	[4.96e-03]	[213,0,0]	[273,3,0]	[11,0,0]	[187,1,0]	[9,0,0]	[281,1,0]	[40,1,0]	[447,9,0]	[862,9,0]	[599,6,0]	[124]	41118	[3.02e-03]	[20444,106,9]	[1461,15,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[9]	[0]	NA	NA	NA	["SAS"]	[9]	[912]	[9.87e-03]	[7]	[3.16e-04]	[30]	[2.86e-01]	[7.73e-03]	["RF"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12573	["T","C"]	"chr1_12573_T_C"	3.67e+02	{}	[2]	[4.80e-04]	4168	1.56e-01	-3.87e-01	False	159660	0.00e+00	-7.56e-02	2.20e+01	-7.20e-01	4.03e+00	4.06e-01	5.91e-01	3.25e+00	"QD"	False	False	["0\|0\|0\|2\|0\|0\|1\|1\|0\|0\|0\|1\|0\|0\|0\|0\|0\|0\|0\|2"]	["2\|3\|1\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|0\|0\|1\|1\|0\|0\|1\|1\|0\|1\|1\|0\|1\|0\|0\|0\|0"]	"4681\|6449\|2465\|3412\|2043\|675\|838\|436\|128\|204\|107\|32\|44\|1\|14\|0\|6\|1\|1\|2"	"12932\|6401\|1677\|420\|84\|22\|3\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|0\|0\|1\|1\|0\|0\|1\|1\|0\|1\|1\|0\|1\|0\|0\|0\|0"	[0]	[2]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[2]	622	716	44	528	36	850	124	1248	2348	1820	[0.00e+00]	[2.79e-03]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[1.10e-03]	[311,0,0]	[356,2,0]	[22,0,0]	[264,0,0]	[18,0,0]	[425,0,0]	[62,0,0]	[624,0,0]	[1174,0,0]	[908,2,0]	[7]	43078	[1.62e-04]	[21532,7,0]	[2082,2,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	NA	NA	NA	["AMR"]	[2]	[716]	[2.79e-03]	[8]	[1.54e-07]	[36]	[5.33e-01]	[2.70e-01]	["PASS"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12586	["C","T"]	"chr1_12586_C_T"	2.24e+02	{}	[2]	[5.63e-04]	3550	2.93e+00	-1.24e-01	False	149739	0.00e+00	-7.86e-02	2.20e+01	2.89e-01	5.74e+00	2.89e-01	5.57e-01	2.25e+00	"MQ"	False	False	["0\|0\|0\|0\|0\|0\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|2"]	["0\|1\|1\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|0\|0\|0\|0\|1\|1\|0\|0\|0\|0\|0\|1\|0\|0\|0\|0\|0"]	"4597\|6767\|2443\|3466\|1994\|601\|758\|390\|126\|150\|87\|21\|20\|5\|8\|4\|3\|0\|0\|2"	"13424\|6201\|1423\|326\|52\|15\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|0\|0\|0\|0\|1\|1\|0\|0\|0\|0\|0\|1\|0\|0\|0\|0\|0"	[0]	[0]	[0]	[1]	[0]	[0]	[0]	[1]	[0]	[2]	530	606	34	434	34	692	106	1114	2050	1500	[0.00e+00]	[0.00e+00]	[0.00e+00]	[2.30e-03]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[8.98e-04]	[0.00e+00]	[1.33e-03]	[265,0,0]	[303,0,0]	[17,0,0]	[216,1,0]	[17,0,0]	[346,0,0]	[53,0,0]	[556,1,0]	[1025,0,0]	[748,2,0]	[3]	42884	[7.00e-05]	[21439,3,0]	[1773,2,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	NA	NA	NA	["EAS"]	[1]	[434]	[2.30e-03]	[13]	[5.01e-12]	[98]	[4.21e-01]	[3.83e-01]	["PASS"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
chr1:12596	["C","A"]	"chr1_12596_C_A"	4.48e+01	{"AC0"}	[0]	[0.00e+00]	2932	2.22e+00	3.61e-01	False	140242	0.00e+00	-8.29e-02	2.20e+01	0.00e+00	4.97e+00	3.61e-01	8.92e-01	2.41e+00	"MQ"	False	False	["0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|1\|0\|0\|0"]	["0\|1\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"]	["0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|1\|0\|0\|0\|0\|0\|0\|0\|0"]	"4727\|7110\|2504\|3416\|1817\|550\|609\|333\|79\|110\|46\|6\|19\|4\|0\|3\|1\|0\|0\|0"	"13901\|5942\|1199\|251\|36\|5\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0"	"0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|0\|1\|0\|0\|0\|0\|0\|0\|0\|0"	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	490	436	22	360	22	602	78	922	1670	1262	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[0.00e+00]	[245,0,0]	[218,0,0]	[11,0,0]	[180,0,0]	[11,0,0]	[301,0,0]	[39,0,0]	[461,0,0]	[835,0,0]	[631,0,0]	[1]	42668	[2.34e-05]	[21333,1,0]	[1466,0,0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	[0]	NA	NA	NA	NA	NA	NA	NA	[9]	[2.00e-11]	[80]	[5.56e-01]	[4.03e-01]	["AC0"]	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

showing top 10 rows

If we do a .describe(), we will see more info about the row fields.

There is only one column field, s, which is the key that maps to the Sample ID. We will add more column fields when we merge in the Pheno Data in the next section.

# View structure of MT before adding pheno data

mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

We can look at the first few rows and columns of the genotyping table with mt.GT.show(). Note that it includes the locus/alleles column. This is because the row key corresponds to a combination of both locus and alleles.

mt.GT.show()


		'sample_1_0'	'sample_1_1'	'sample_1_2'	'sample_1_3'
locus	alleles	GT	GT	GT	GT
locus<GRCh38>	array<str>	call	call	call	call
chr1:12198	["G","C"]	0/1	0/0	1/1	0/0
chr1:12237	["G","A"]	1/1	0/0	0/1	0/0
chr1:12259	["G","C"]	1/1	0/0	0/1	0/0
chr1:12266	["G","A"]	0/1	1/1	0/1	0/0
chr1:12272	["G","A"]	0/0	0/1	0/0	0/1
chr1:12554	["A","G"]	0/0	0/0	0/0	0/0
chr1:12559	["G","A"]	0/0	0/0	0/0	0/0
chr1:12573	["T","C"]	0/0	0/0	0/0	0/0
chr1:12586	["C","T"]	0/0	0/0	0/0	0/0
chr1:12596	["C","A"]	0/0	0/0	0/0	0/0

showing top 10 rows

showing the first 4 of 100000 columns

7.5 Load and Merge Pheno Table

Phenotypic data may come from an array of sources, such as a cohort from the Cohort Browser or a separate, stand-alone text file. In this notebook, we use phenotypic data from a CSV file, which was previously uploaded to a project. In this (very basic) example we use the phenotypic trait, is_case, for each sample. The values indicate if the sample is a case, is_case=true, or a control, is_case=false

Note: Although this notebook provides an example using a stand-alone CSV file, another source of phenotypic data could be derived from an Apollo Dataset. Please refer to Table Exporter documentation (https://documentation.dnanexus.com/developer/apps/developing-spark-apps/table-exporter-application) or OpenBio notebooks (https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit) with extract_dataset in the title for guidance on how to extract data from an Apollo Dataset.

All data uploaded to the project before running the JupyterLab app is mounted (https://documentation.dnanexus.com/user/jupyter-notebooks?#accessing-data) and can be accessed in /mnt/project/<path_to_data>. The file URL follows the format: file:///mnt/project/<path_to_data>

# Import the pheno CSV file as a Hail Table

pheno_table = hl.import_table("file:///mnt/project/data/ukbb_100k_bmi_casecontrol.csv",
                              delimiter=',',
                              impute=True,
                              key='iid') # specify the column that will be the key (values must match what is in the MT 's' column)

pheno_table.show()

2023-04-19 22:30:24 Hail: INFO: Reading table to impute column types
2023-04-19 22:30:29 Hail: INFO: Finished type imputation
  Loading field 'fid' as type str (imputed)
  Loading field 'iid' as type str (imputed)
  Loading field 'father_iid' as type int32 (imputed)
  Loading field 'mother_iid' as type int32 (imputed)
  Loading field 'sex_code' as type int32 (imputed)
  Loading field 'case_control_status' as type int32 (imputed)


fid	iid	father_iid	mother_iid	sex_code	case_control_status
str	str	int32	int32	int32	int32
"sample_1_10"	"sample_1_10"	0	0	1	2
"sample_1_10003"	"sample_1_10003"	0	0	2	2
"sample_1_10014"	"sample_1_10014"	0	0	2	1
"sample_1_10020"	"sample_1_10020"	0	0	1	2
"sample_1_10021"	"sample_1_10021"	0	0	2	1
"sample_1_10023"	"sample_1_10023"	0	0	1	1
"sample_1_10036"	"sample_1_10036"	0	0	2	1
"sample_1_10037"	"sample_1_10037"	0	0	1	1
"sample_1_10047"	"sample_1_10047"	0	0	1	1
"sample_1_10056"	"sample_1_10056"	0	0	1	2

showing top 10 rows

One thing I always do when I’m loading in a Hail Table is use .summarize() on the entire Table. This gives me a high level overview of the Table.

pheno_table.summarize()

2023-04-19 18:58:14 Hail: INFO: Coerced sorted dataset

19884 records.

fid (str):

We need to modify case_control_status in our table, because it needs to be 0/1 instead of 1/2 to work with Hail’s .logistic_regression() method.

We can do this with .annotate(), which will add a column to our table. We will subtract 1 from both case_control_status and sex_code.

1pheno_fixed = pheno_table.annotate(ccs = pheno_table.case_control_status - 1,
2                                  sc = pheno_table.sex_code - 1)
pheno_fixed.show()

1: Subtract 1 from the Case Control Status
2: Subtract 2 from the Sex Code

2023-04-19 22:30:36 Hail: INFO: Coerced sorted dataset


fid	iid	father_iid	mother_iid	sex_code	case_control_status	ccs	sc
str	str	int32	int32	int32	int32	int32	int32
"sample_1_10"	"sample_1_10"	0	0	1	2	1	0
"sample_1_10003"	"sample_1_10003"	0	0	2	2	1	1
"sample_1_10014"	"sample_1_10014"	0	0	2	1	0	1
"sample_1_10020"	"sample_1_10020"	0	0	1	2	1	0
"sample_1_10021"	"sample_1_10021"	0	0	2	1	0	1
"sample_1_10023"	"sample_1_10023"	0	0	1	1	0	0
"sample_1_10036"	"sample_1_10036"	0	0	2	1	0	1
"sample_1_10037"	"sample_1_10037"	0	0	1	1	0	0
"sample_1_10047"	"sample_1_10047"	0	0	1	1	0	0
"sample_1_10056"	"sample_1_10056"	0	0	1	2	1	0

showing top 10 rows

If we .describe(), we’ll see that we added two row fields: ccs and sc.

# View structure of pheno Table

pheno_fixed.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'fid': str 
    'iid': str 
    'father_iid': int32 
    'mother_iid': int32 
    'sex_code': int32 
    'case_control_status': int32 
    'ccs': int32 
    'sc': int32 
----------------------------------------
Key: ['iid']
----------------------------------------

pheno_fixed.show(10)

2023-04-11 18:45:20 Hail: INFO: Coerced sorted dataset


fid	iid	father_iid	mother_iid	sex_code	case_control_status	ccs	sc
str	str	int32	int32	int32	int32	int32	int32
"sample_1_10"	"sample_1_10"	0	0	1	2	1	0
"sample_1_10003"	"sample_1_10003"	0	0	2	2	1	1
"sample_1_10014"	"sample_1_10014"	0	0	2	1	0	1
"sample_1_10020"	"sample_1_10020"	0	0	1	2	1	0
"sample_1_10021"	"sample_1_10021"	0	0	2	1	0	1
"sample_1_10023"	"sample_1_10023"	0	0	1	1	0	0
"sample_1_10036"	"sample_1_10036"	0	0	2	1	0	1
"sample_1_10037"	"sample_1_10037"	0	0	1	1	0	0
"sample_1_10047"	"sample_1_10047"	0	0	1	1	0	0
"sample_1_10056"	"sample_1_10056"	0	0	1	2	1	0

showing top 10 rows

7.5.1 Annotate MT with pheno Table

# Annotate the MT with pheno Table by matching the MT's column key ('s') with the pheno Table's key ('sample_id')

phenogeno_mt = mt.annotate_cols(**pheno_fixed[mt.s])

If we look at what happened in the merge, we will see that there are missing values in the pheno data. That’s because the pheno file only had 20000 rows. This means we’ll drop a large amount of the population when we actually run the GWAS, because they have no case/control status nor do they have covariates.

phenogeno_mt.col.show(20)


s	fid	father_iid	mother_iid	sex_code	case_control_status	ccs	sc
str	str	int32	int32	int32	int32	int32	int32
"sample_1_0"	NA	NA	NA	NA	NA	NA	NA
"sample_1_1"	NA	NA	NA	NA	NA	NA	NA
"sample_1_2"	NA	NA	NA	NA	NA	NA	NA
"sample_1_3"	"sample_1_3"	0	0	2	1	0	1
"sample_1_4"	NA	NA	NA	NA	NA	NA	NA
"sample_1_5"	NA	NA	NA	NA	NA	NA	NA
"sample_1_6"	NA	NA	NA	NA	NA	NA	NA
"sample_1_7"	NA	NA	NA	NA	NA	NA	NA
"sample_1_8"	NA	NA	NA	NA	NA	NA	NA
"sample_1_9"	NA	NA	NA	NA	NA	NA	NA
"sample_1_10"	"sample_1_10"	0	0	1	2	1	0
"sample_1_11"	NA	NA	NA	NA	NA	NA	NA
"sample_1_12"	NA	NA	NA	NA	NA	NA	NA
"sample_1_13"	NA	NA	NA	NA	NA	NA	NA
"sample_1_14"	NA	NA	NA	NA	NA	NA	NA
"sample_1_15"	NA	NA	NA	NA	NA	NA	NA
"sample_1_16"	NA	NA	NA	NA	NA	NA	NA
"sample_1_17"	"sample_1_17"	0	0	2	1	0	1
"sample_1_18"	NA	NA	NA	NA	NA	NA	NA
"sample_1_19"	NA	NA	NA	NA	NA	NA	NA

showing top 20 rows

After merging in the pheno table, we see that we’ve added a number of fields to Column Fields. This includes fid, sex_code, case_control_status and the columns we derived from them.

# View structure of MT after annotating with pheno Table

phenogeno_mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'fid': str
    'father_iid': int32
    'mother_iid': int32
    'sex_code': int32
    'case_control_status': int32
    'ccs': int32
    'sc': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

We see that the pheno traits have been added in the column fields of the MT

7.6 Filter MT using Sample QC Tables

7.6.1 Filter sample QC Table

The above figure illustrates the basic process for filtering on Sample QC metrics. We first need to build the sample_qc table called pre_sample_qc_tb. Then we will extract the row fields using the .row() accessor and write it to DNAX using pre_sample_qc_table.write(). (see https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/sample_qc.ipynb for more info)

This table was built using:

pre_sample_qc_tb = hl.sample_qc(mt).cols()

# Define sample QC Table url

sample_qc_url = "dnax://database-GQYV2Bj04bPg9X3KfFyj2jyY/sample_qc.ht"

# Read sample QC Table

pre_sample_qc_tb = hl.read_table(sample_qc_url)

# View structure of sample QC Table

pre_sample_qc_tb.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    's': str 
    'sample_qc': struct {
        dp_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transversion: int64, 
        n_star: int64, 
        r_ti_tv: float64, 
        r_het_hom_var: float64, 
        r_insertion_deletion: float64
    } 
----------------------------------------
Key: ['s']
----------------------------------------

Let’s plot the call rate across the samples. This will help us decide what our cutoff should be.

from bokeh.io import output_notebook, show
output_notebook()

p = hl.plot.histogram(pre_sample_qc_tb["sample_qc"]["call_rate"])
show(p)

Loading BokehJS ...

Let’s filter for samples that have a call rate greater or equal to 0.99

# Filter sample QC Table using expressions
# Note: Viewing the structure of the sample QC table in from the cell above 
# shows us that the "call_rate" field is within the "sample_qc" struct field

sample_qc_tb = pre_sample_qc_tb.filter(
    pre_sample_qc_tb["sample_qc"]["call_rate"] >= 0.99)

If we do a count before and after filtering, we’ll see we dropped roughly 50000 participants using this filtering step.

# View number of samples in QC Table before and after filtering
#
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

print(f"Num samples before filtering: {pre_sample_qc_tb.count()}")
print(f"Num samples after filtering: {sample_qc_tb.count()}")

Num samples before filtering: 100000
Num samples after filtering: 50062

7.6.2 Filter MT with Sample QC Table

Now we can use .semi_join_cols() to filter our phenogeno_mt with sample_qc_tb.

# Filter the MT using the sample QC Table

qc_mt = phenogeno_mt.semi_join_cols(sample_qc_tb)

7.7 Build Variant QC table from MT, use to filter MT

Similar to the Sample QC, we can calculate a variant QC table for our MatrixTable, and extract it as a separate table and store it in DNAX. Then we can filter on this table, and use .semi_join_rows() to filter our MatrixTable.

7.7.1 4a) Extract Locus QC table

This table was built using

pre_locus_qc_tb = hl.variant_qc(mt).rows()

(see https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/locus_qc.ipynb for more info)

# Define locus QC Table url

locus_qc_url = "dnax://database-GQYV2Bj04bPg9X3KfFyj2jyY/variant_qc.ht"

# Read locus QC Table

pre_locus_qc_tb = hl.read_table(locus_qc_url)

# View structure of locus QC Table

pre_locus_qc_tb.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh38> 
    'alleles': array<str> 
    'rsid': str 
    'qual': float64 
    'filters': set<str> 
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    } 
    'variant_qc': struct {
        dp_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        homozygote_count: array<int32>, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_het: int64, 
        n_non_ref: int64, 
        het_freq_hwe: float64, 
        p_value_hwe: float64
    } 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------

Let’s filter for loci that have: - an allele frequency (AF) value between 0.001-0.999, - a Hardy-Weinberg Equilibrium p-value (p_value_hwe) greater or equal to 1e-10, - a call rate (call_rate) greater or equal to 0.9

# Filter QC Table using expressions
# Note: Viewing the structure of the locus QC table in from the cell above 
# shows us that the "AF", "p_value_hwe", and "call_rate" fields are within
# the "variant_qc" struct field.

locus_qc_tb = pre_locus_qc_tb.filter(
    (pre_locus_qc_tb["variant_qc"]["AF"][0] >= 0.001) & 
    (pre_locus_qc_tb["variant_qc"]["AF"][0] <= 0.999) & 
    (pre_locus_qc_tb["variant_qc"]["p_value_hwe"] >= 1e-10) & 
    (pre_locus_qc_tb["variant_qc"]["call_rate"] >= 0.9)
)

# DON'T RUN in class
# View number of loci in QC Table before and after filtering
#
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

print(f"Num loci before filtering: {pre_locus_qc_tb.count()}")
print(f"Num loci after filtering: {locus_qc_tb.count()}")

Num loci before filtering: 3403513

Num loci after filtering: 39952

7.7.1.1 4b) Filter MT with variant QC Tables

Now we can use .semi_join_rows() to filter our phenogeno_mt with locus_qc_tb.

# Filter the MT using the locus QC Table

qc_mt = qc_mt.semi_join_rows(locus_qc_tb)

After filtereing, we have reduced the number of loci in our MatrixTable to be 39952 rows.

# DON'T RUN in class
# View MT after QC filters
# 
# Note: running 'mt.rows().count()' or 'mt.cols().count()' can be computationally 
# expensive and take longer for bigger datasets (these lines can be commented out).

print(f"Num partitions: {qc_mt.n_partitions()}")
print(f"Num loci: {qc_mt.rows().count()}")
print(f"Num samples: {qc_mt.cols().count()}")
qc_mt.describe()

Num partitions: 228
Num loci: 39952

2023-04-19 22:33:57 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'

Num samples: 50062
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'fid': str
    'father_iid': int32
    'mother_iid': int32
    'sex_code': int32
    'case_control_status': int32
    'ccs': int32
    'sc': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

7.8 Run GWAS

Now that we’ve done our filtering, we can now run our GWAS on our filtered MatrixTable.

Quick reminder that our pheno file has a lot of missing values, so we’re going to be dropping a bunch of samples when we do our analysis.

qc_mt.ccs.show()


s	ccs
str	int32
"sample_1_1"	NA
"sample_1_3"	0
"sample_1_6"	NA
"sample_1_9"	NA
"sample_1_10"	1
"sample_1_11"	NA
"sample_1_12"	NA
"sample_1_13"	NA
"sample_1_14"	NA
"sample_1_15"	NA

showing top 10 rows

Additional documentation: https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows

# Run Hail's logistic regression method

gwas = hl.logistic_regression_rows(test="firth",
                                   y=qc_mt.col.ccs, # the column field of the pheno trait we are looking at ('ccs')
                                   x=qc_mt.GT.n_alt_alleles(), # n_alt_alleles() returns the count of non-reference alleles
                                   covariates=[1, qc_mt.col.sc])

2023-04-19 22:34:55 Hail: WARN: 40104 of 50062 samples have a missing phenotype or covariate.
2023-04-19 22:34:55 Hail: INFO: logistic_regression_rows: running firth on 9958 samples for response variable y,
    with input variable x, and 2 additional covariates...

# View structure of GWAS results Table

gwas.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh38> 
    'alleles': array<str> 
    'beta': float64 
    'chi_sq_stat': float64 
    'p_value': float64 
    'fit': struct {
        n_iterations: int32, 
        converged: bool, 
        exploded: bool
    } 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------

gwas.show()


					fit
locus	alleles	beta	chi_sq_stat	p_value	n_iterations	converged	exploded
locus<GRCh38>	array<str>	float64	float64	float64	int32	bool	bool
chr1:12719	["G","C"]	1.96e-02	1.35e-02	9.08e-01	3	True	False
chr1:13380	["C","G"]	-3.00e-01	9.08e-01	3.41e-01	5	True	False
chr1:13494	["A","G"]	-3.33e-01	6.83e-01	4.09e-01	5	True	False
chr1:13504	["G","A"]	1.50e-01	2.84e-01	5.94e-01	5	True	False
chr1:17358	["ACTT","A"]	4.46e-01	1.30e+00	2.54e-01	5	True	False
chr1:17373	["A","G"]	-3.88e-02	4.94e-02	8.24e-01	4	True	False
chr1:17379	["G","A"]	4.65e-01	2.92e+00	8.72e-02	4	True	False
chr1:17403	["A","G"]	1.51e-01	1.41e-01	7.07e-01	5	True	False
chr1:17408	["C","G"]	-8.03e-02	1.80e-01	6.71e-01	4	True	False
chr1:69735	["A","G"]	7.29e-02	3.18e-02	8.59e-01	5	True	False

showing top 10 rows


pandas_gwas = gwas.to_pandas()
pandas_gwas.to_csv("gwas_results.csv")

7.9 Visualize GWAS results

Bokeh is a Python library that is included in this JupyterLab environment- which makes it easy for us to import.

We’ll need the output_notebook and show modules to make our plots.

from bokeh.io import output_notebook, show
output_notebook()

Loading BokehJS ...

qq_plot = hl.plot.qq(gwas.p_value)
show(qq_plot)

2023-04-19 22:12:04 Hail: INFO: Ordering unsorted dataset with network shuffle

We can generate a Manhattan Plot on our p-values using hl.plot.manhattan().

manhattan_plot = hl.plot.manhattan(gwas_tb.p_value)
show(manhattan_plot)

7.10 Annotate GWAS results

Now that we have our GWAS results, we’ll annotate our file using Hail’s experimental DB, which is hosted on Open Data on AWS.

Note that this resource is not available in the RAP region. Here we’re going to add GENCODE annotations to our matrix. There are a lot of other annotations available in the DB as well.

There is the query builder if you have access to the experimental annotations. This will help you write your db.annotate_rows_db() statement.

https://hail.is/docs/0.2/annotation_database_ui.html#database-query

db = hl.experimental.DB(region='us', cloud='aws')
ann_gwas_tb = db.annotate_rows_db(gwas, 'gencode')
ann_gwas_tb.show()


					fit
locus	alleles	beta	chi_sq_stat	p_value	n_iterations	converged	exploded	gencode
locus<GRCh38>	array<str>	float64	float64	float64	int32	bool	bool	array<struct{ID: str, gene_id: str, gene_name: str, gene_type: str, level: str, type: str, gene_score: str, gene_strand: str, gene_phase: str}>
chr1:12719	["G","C"]	1.96e-02	1.35e-02	9.08e-01	3	True	False	[("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]
chr1:13380	["C","G"]	-3.00e-01	9.08e-01	3.41e-01	5	True	False	[("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]
chr1:13494	["A","G"]	-3.33e-01	6.83e-01	4.09e-01	5	True	False	[("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]
chr1:13504	["G","A"]	1.50e-01	2.84e-01	5.94e-01	5	True	False	[("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]
chr1:17358	["ACTT","A"]	4.46e-01	1.30e+00	2.54e-01	5	True	False	[("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:17373	["A","G"]	-3.88e-02	4.94e-02	8.24e-01	4	True	False	[("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:17379	["G","A"]	4.65e-01	2.92e+00	8.72e-02	4	True	False	[("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:17403	["A","G"]	1.51e-01	1.41e-01	7.07e-01	5	True	False	[("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:17408	["C","G"]	-8.03e-02	1.80e-01	6.71e-01	4	True	False	[("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:69735	["A","G"]	7.29e-02	3.18e-02	8.59e-01	5	True	False	[("ENSG00000186092.6","ENSG00000186092.6","OR4F5","protein_coding","2","gene",".","+",".")]

showing top 10 rows

On RAP, we do not have access to this resource, as it isn’t available in the UKB RAP region, so we must use the Hail/VEP configuration (note we started with the Hail/VEP features).

hl.vep() requires a config file in JSON format. I have used the one available here: https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster#using-vep-with-hail

ann_gwas_tb2 = hl.vep(ann_gwas_tb, "file:///mnt/project/notebooks/config.json")
ann_gwas_tb2.show()


					fit				vep																	vep_proc_id
locus	alleles	beta	chi_sq_stat	p_value	n_iterations	converged	exploded	gencode	assembly_name	allele_string	ancestral	colocated_variants	context	end	id	input	intergenic_consequences	most_severe_consequence	motif_feature_consequences	regulatory_feature_consequences	seq_region_name	start	strand	transcript_consequences	variant_class	part_idx	block_idx
locus<GRCh38>	array<str>	float64	float64	float64	int32	bool	bool	array<struct{ID: str, gene_id: str, gene_name: str, gene_type: str, level: str, type: str, gene_score: str, gene_strand: str, gene_phase: str}>	str	str	str	array<struct{aa_allele: str, aa_maf: float64, afr_allele: str, afr_maf: float64, allele_string: str, amr_allele: str, amr_maf: float64, clin_sig: array<str>, end: int32, eas_allele: str, eas_maf: float64, ea_allele: str, ea_maf: float64, eur_allele: str, eur_maf: float64, exac_adj_allele: str, exac_adj_maf: float64, exac_allele: str, exac_afr_allele: str, exac_afr_maf: float64, exac_amr_allele: str, exac_amr_maf: float64, exac_eas_allele: str, exac_eas_maf: float64, exac_fin_allele: str, exac_fin_maf: float64, exac_maf: float64, exac_nfe_allele: str, exac_nfe_maf: float64, exac_oth_allele: str, exac_oth_maf: float64, exac_sas_allele: str, exac_sas_maf: float64, id: str, minor_allele: str, minor_allele_freq: float64, phenotype_or_disease: int32, pubmed: array<int32>, sas_allele: str, sas_maf: float64, somatic: int32, start: int32, strand: int32}>	str	int32	str	str	array<struct{allele_num: int32, consequence_terms: array<str>, impact: str, minimised: int32, variant_allele: str}>	str	array<struct{allele_num: int32, consequence_terms: array<str>, high_inf_pos: str, impact: str, minimised: int32, motif_feature_id: str, motif_name: str, motif_pos: int32, motif_score_change: float64, strand: int32, variant_allele: str}>	array<struct{allele_num: int32, biotype: str, consequence_terms: array<str>, impact: str, minimised: int32, regulatory_feature_id: str, variant_allele: str}>	str	int32	int32	array<struct{allele_num: int32, amino_acids: str, appris: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, tsl: int32, uniparc: str, variant_allele: str}>	str	int32	int32
chr1:12719	["G","C"]	1.96e-02	1.35e-02	9.08e-01	3	True	False	[("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]	"GRCh38"	"G/C"	NA	[(NA,NA,NA,NA,"G/C",NA,NA,NA,12719,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs1410641955",NA,NA,NA,NA,NA,NA,NA,12719,1)]	NA	12719	"."	"chr1 12719 . G C . . GT"	NA	"splice_region_variant"	NA	NA	"chr1"	12719	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000450305.2:n.182+22G>C",NA,NA,"MODIFIER","3/5",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"C"),(1,NA,NA,"processed_transcript",1,NA,466,466,NA,NA,NA,["splice_region_variant","non_coding_transcript_exon_variant"],NA,NA,"2/3","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000456328.2:n.466G>C",NA,NA,"LOW",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"C"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],1685,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"C"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],4650,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"C")]	"SNV"	0	0
chr1:13380	["C","G"]	-3.00e-01	9.08e-01	3.41e-01	5	True	False	[("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]	"GRCh38"	"C/G"	NA	[(NA,NA,NA,NA,"C/G",NA,NA,NA,13380,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs571093408","G",8.20e-03,NA,NA,NA,NA,NA,13380,1)]	NA	13380	"."	"chr1 13380 . C G . . GT"	NA	"splice_region_variant"	[(1,["TF_binding_site_variant"],"N","MODIFIER",NA,"ENSM00080464946","ENSPFM0257",19,-2.10e-02,-1,"G")]	[(1,"promoter_flanking_region",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00001164745","G")]	"chr1"	13380	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["splice_region_variant","intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000450305.2:n.414+6C>G",NA,NA,"LOW","5/5",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,628,628,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"3/3","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000456328.2:n.628C>G",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],1024,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3989,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")]	"SNV"	0	0
chr1:13494	["A","G"]	-3.33e-01	6.83e-01	4.09e-01	5	True	False	[("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]	"GRCh38"	"A/G"	NA	[(NA,NA,NA,NA,"A/G",NA,NA,NA,13494,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs574697788","G",1.40e-03,NA,NA,NA,NA,NA,13494,1)]	NA	13494	"."	"chr1 13494 . A G . . GT"	NA	"non_coding_transcript_exon_variant"	NA	[(1,"CTCF_binding_site",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00000344265","G"),(1,"promoter_flanking_region",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00001164745","G")]	"chr1"	13494	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,456,456,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"6/6","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000450305.2:n.456A>G",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,742,742,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"3/3","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000456328.2:n.742A>G",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],910,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3875,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")]	"SNV"	0	0
chr1:13504	["G","A"]	1.50e-01	2.84e-01	5.94e-01	5	True	False	[("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]	"GRCh38"	"G/A"	NA	[(NA,NA,NA,NA,"G/A",NA,NA,NA,13504,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs199896944",NA,NA,NA,NA,NA,NA,NA,13504,1)]	NA	13504	"."	"chr1 13504 . G A . . GT"	NA	"non_coding_transcript_exon_variant"	NA	[(1,"CTCF_binding_site",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00000344265","A"),(1,"promoter_flanking_region",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00001164745","A")]	"chr1"	13504	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,466,466,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"6/6","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000450305.2:n.466G>A",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"A"),(1,NA,NA,"processed_transcript",1,NA,752,752,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"3/3","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000456328.2:n.752G>A",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"A"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],900,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"A"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3865,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"A")]	"SNV"	0	0
chr1:17358	["ACTT","A"]	4.46e-01	1.30e+00	2.54e-01	5	True	False	[("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]	"GRCh38"	"CTT/-"	NA	[(NA,NA,NA,NA,"CTTCTTCT/CTTCT",NA,NA,NA,17366,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs749387668",NA,NA,NA,NA,NA,NA,NA,17359,1)]	NA	17361	"."	"chr1 17358 . ACTT A . . GT"	NA	"non_coding_transcript_exon_variant"	NA	NA	"chr1"	17359	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3689,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"-"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2950,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"-"),(1,NA,NA,"unprocessed_pseudogene",1,NA,582,584,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"6/11","ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.582_584del",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"-"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],8,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"-")]	"deletion"	0	0
chr1:17373	["A","G"]	-3.88e-02	4.94e-02	8.24e-01	4	True	False	[("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]	"GRCh38"	"A/G"	NA	[(NA,NA,NA,NA,"A/G",NA,NA,NA,17373,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs750111615",NA,NA,NA,NA,NA,NA,NA,17373,1)]	NA	17373	"."	"chr1 17373 . A G . . GT"	NA	"splice_region_variant"	NA	NA	"chr1"	17373	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3703,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2964,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["splice_region_variant","intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.575-5T>C",NA,NA,"LOW","5/10",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,64,64,NA,NA,NA,["mature_miRNA_variant"],NA,NA,"1/1","ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039","ENST00000619216.1:n.64T>C",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")]	"SNV"	0	0
chr1:17379	["G","A"]	4.65e-01	2.92e+00	8.72e-02	4	True	False	[("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]	"GRCh38"	"G/A"	NA	[(NA,NA,NA,NA,"G/A",NA,NA,NA,17379,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs754322362",NA,NA,NA,NA,NA,NA,NA,17379,1)]	NA	17379	"."	"chr1 17379 . G A . . GT"	NA	"mature_miRNA_variant"	NA	NA	"chr1"	17379	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3709,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"A"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2970,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"A"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.575-11C>T",NA,NA,"MODIFIER","5/10",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"A"),(1,NA,NA,"miRNA",1,NA,58,58,NA,NA,NA,["mature_miRNA_variant"],NA,NA,"1/1","ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039","ENST00000619216.1:n.58C>T",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"A")]	"SNV"	0	0
chr1:17403	["A","G"]	1.51e-01	1.41e-01	7.07e-01	5	True	False	[("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]	"GRCh38"	"A/G"	NA	[(NA,NA,NA,NA,"A/G",NA,NA,NA,17403,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs111588939",NA,NA,NA,NA,NA,NA,NA,17403,1)]	NA	17403	"."	"chr1 17403 . A G . . GT"	NA	"non_coding_transcript_exon_variant"	NA	NA	"chr1"	17403	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3733,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2994,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.575-35T>C",NA,NA,"MODIFIER","5/10",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,34,34,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"1/1","ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039","ENST00000619216.1:n.34T>C",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")]	"SNV"	0	0
chr1:17408	["C","G"]	-8.03e-02	1.80e-01	6.71e-01	4	True	False	[("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]	"GRCh38"	"C/G"	NA	[(NA,NA,NA,NA,"C/G",NA,NA,NA,17408,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs747093451",NA,NA,NA,NA,NA,NA,NA,17408,1)]	NA	17408	"."	"chr1 17408 . C G . . GT"	NA	"non_coding_transcript_exon_variant"	NA	NA	"chr1"	17408	1	[(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3738,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2999,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.575-40G>C",NA,NA,"MODIFIER","5/10",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,29,29,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"1/1","ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039","ENST00000619216.1:n.29G>C",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")]	"SNV"	0	0
chr1:69735	["A","G"]	7.29e-02	3.18e-02	8.59e-01	5	True	False	[("ENSG00000186092.6","ENSG00000186092.6","OR4F5","protein_coding","2","gene",".","+",".")]	"GRCh38"	"A/G"	NA	[(NA,NA,NA,NA,"A/G",NA,NA,NA,69735,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs1245478362",NA,NA,NA,NA,NA,NA,NA,69735,1)]	NA	69735	"."	"chr1 69735 . A G . . GT"	NA	"synonymous_variant"	NA	NA	"chr1"	69735	1	[(1,"L","P2","protein_coding",NA,"CCDS30547.1",681,681,645,645,"ctA/ctG",["synonymous_variant"],NA,[("Gene3D","1"),("Pfam","PF13853"),("PROSITE_profiles","PS50262"),("PANTHER","PTHR26451"),("PANTHER","PTHR26451"),("Superfamily","SSF81321"),("Transmembrane_helices","TMhelix"),("CDD","cd15226")],"1/1","ENSG00000186092",NA,"OR4F5","HGNC","HGNC:14825","ENST00000335137.4:c.645A>G","ENSP00000334393.3:p.Leu215=",NA,"LOW",NA,NA,NA,NA,NA,NA,NA,NA,215,215,"ENSP00000334393",NA,NA,1,NA,"ENST00000335137",NA,NA,NA,"G"),(1,"L","A2","protein_coding",1,NA,768,768,708,708,"ctA/ctG",["synonymous_variant"],NA,[("Gene3D","1"),("PROSITE_profiles","PS50262"),("Transmembrane_helices","TMhelix"),("Superfamily","SSF81321"),("PANTHER","PTHR26451"),("PANTHER","PTHR26451"),("Pfam","PF13853"),("CDD","cd15226")],"3/3","ENSG00000186092",NA,"OR4F5","HGNC","HGNC:14825","ENST00000641515.2:c.708A>G","ENSP00000493376.2:p.Leu236=",NA,"LOW",NA,NA,NA,NA,NA,NA,NA,NA,236,236,"ENSP00000493376",NA,NA,1,NA,"ENST00000641515",NA,NA,NA,"G")]	"SNV"	0	0

showing top 10 rows

7.11 Export GWAS Table, Export MT to BGEN

We can convert our Hail GWAS table and bring it into memory using the .to_pandas() method. Then we can write it to HDFS using .to_csv().

# Convert to Pandas DataFrame
ann_gwas_pd = ann_gwas_tb.to_pandas()
# Save as CSV file to Hadoop File System
ann_gwas_pd.to_csv("gwas-results.csv")

Our gwas-results.csv file is now in the Hadoop File System. Then we can run hdfs get to pull it onto the disk of the Driver node. Then we can use dx upload to get it into project storage.

%%bash

#Fetch results file from Hadoop File System and save to Driver Node Storage
hdfs dfs -get gwas-results.csv
#Upload resutls file to Project Storage
dx upload gwas-results.csv --destination "/users/tladeras/"

If we want to output BGEN files, there is no direct way for the notebook to write data into the project, so we will first write into HDFS (see https://documentation.dnanexus.com/developer/apps/developing-spark-apps#spark-cluster-management-software).

After writing out the BGEN files to HDFS, we can then move the data to the project in the next step as above.

Note that we can also do this in a single step by using dxFUSE in -limitedWrite mode.

Additional documentation: https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_bgen

# Create a set of unique chromosomes found in MT

chr_set = mt.aggregate_rows(hl.agg.collect_as_set(mt.locus.contig))

# Filter MT to a single chromosome and write out as BGEN file to HDFS as a single file for each chromosome in the MT
for chrom in chr_set:
    filtered_mt = hl.filter_intervals(mt, [hl.parse_locus_interval(chrom, reference_genome="GRCh38"),])
    hl.export_bgen(filtered_mt, f"{chrom}")

%%bash
# Copy BGEN files from HDFS to the JupyterLab execution environment file system

hdfs dfs -get ./*.bgen .

%%bash
# Copy SAMPLE files from HDFS to the JupyterLab execution environment file system

hdfs dfs -get ./*.sample .

%%bash
# Upload BGEN and SAMPLE files to project

dx upload *.bgen
dx upload *.sample

Finally, we can save our filtered MatrixTable to dnax so we can use it later.

# Store Table in DNAXc

import dxpy
db_name = "db_mt_test2"
tb_name = "filtered.mt"

# find database ID of newly created database using a dxpy methodc
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{tb_name}"

# Before this step, the Hail Table is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.Table.html#hail.Table.write for additional documentation.
qc_tb.write(url) # Note: output should describe size of Table (i.e. number of rows, partitions)

Non-missing	19884 (100.00%)
Missing	0
Min Size	10
Max Size	14
Mean Size	13.89
Sample Values	['sample_1_10', 'sample_1_10003', 'sample_1_10014', 'sample_1_10020', 'sample_1_10021']

Non-missing	19884 (100.00%)
Missing	0
Minimum	1
Maximum	2
Mean	1.50
Std Dev	0.50