7  GWAS using Hail

7.1 Learning Objectives of this Session

  • Import data into a Hail MatrixTable from pVCF and BGEN formats
  • Merge phenotypic data into the MatrixTable
  • QC and Filter Variant data for use in a GWAS using Hail
  • QC and Filter Sample data for use in GWAS with Hail
  • Run GWAS on a Hail MatrixTable
  • Annotate Hail MatrixTables with internal DNAnexus resources
  • Export results from Hail to CSV format

Suggested Configuration: mem2_ssd1_v1_x8 with at least 6 nodes (use Hail/VEP feature set)

This notebook shows how to perform a GWAS for 1 case–control trait using Firth’s logistic regression with Hail and save the results as a Hail Table to an Apollo database (dnax://) on the DNAnexus platform. See documentation for guidance on launch specs for the JupyterLab with Spark Cluster app for different data sizes: https://documentation.dnanexus.com/science/using-hail-to-analyze-genomic-data

Note: For population scale data, samples may be referred to as individuals. In this notebook, the word “sample” will be used.

Pre-conditions for running this notebook successfully:

  • There is an existing Hail MatrixTable in DNAX (see pVCF Import Notebook and BGEN Import Notebook)
  • There is a Sample QC Hail Table in DNAX (see https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/sample_qc.ipynb)
  • There is a Variant QC Hail Table in DNAX (see https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/locus_qc.ipynb)
  • There is phenotypic data for the samples

7.2 Basic GWAS process

This table shows the basic GWAS process using Hail. We will cover each of these in the sections below.

Step Description Code Example
0. Initiate Spark and Hail See Notebook
1. Load pVCF/BGEN data, save as MatrixTable (MT) Section 7.4 mt = hl.import_vcf(path) or mt = hl.import_bgen(path)
2. Load Pheno file, merge with MatrixTable Section 7.5 phenogeno_mt = mt.annotate_cols(**pheno_table[mt.s])
3. Build Sample QC table from MT, use to filter MT Section 7.6 sample_qc_tb = hl.sample_qc(mt); qc_mt = phenogeno_mt.semi_join_rows(locus_qc_tb)
4. Build Locus QC table from MT, use to filter MT Section 7.7 locus_qc_tb = hl.locus_qc(mt); qc_mt = qc_mt.semi_join_cols(sample_qc_tb)
5. Run GWAS Section 7.8 gwas_tb = hl.logistic_regression.rows()
6. Visualize Results Section 7.9 manhattan_plot = hl.plot.manhattan(gwas_tb.p_value); show(manhattan_plot)
7. Annotate Results with VEP or annotation db Section 7.10 ann_gwas_tb = db.annotate_rows_db(gwas_tb, 'gencode'
8. Save results to CSV, export chromosomes as BGEN file See Notebook

7.3 Initialize Spark and Hail

Once you run the code below, make sure to open up another tab at https://JOB-URL:8081/jobs to look at the Spark UI. It is extremely helpful in understanding what’s going on behind the scenes.

# Running this cell will output a red-colored message- this is expected.
# The 'Welcome to Hail' message in the output will indicate that Hail is ready to use in the notebook.

from pyspark.sql import SparkSession
import hail as hl

builder = (
    SparkSession
    .builder
    .enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

db_name = "db_mt_test2"
pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 2.4.4
SparkUI available at http://ip-172-31-34-200.ec2.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.78-b17627756568
LOGGING: writing to /opt/notebooks/hail-20230419-2230-0.2.78-b17627756568.log

7.4 Read Genotype MatrixTable

We’ll read in the MatrixTable we loaded in from the pVCF files.

See (supplementary_notebooks/01_pVCF_import.ipynb for more info)

This was the code used to load the VCF files:

path = "file:///mnt/project/Geno_Data/*.vcf.gz"

mt = hl.import_vcf(path, force_bgz=True,
                   reference_genome="GRCh38",
                   array_elements_required=False)
# read MT
mt_url="dnax://database-GQYV2Bj04bPg9X3KfFyj2jyY/geno.mt"
mt = hl.read_matrix_table(mt_url)

We can look at the row fields (Variant metadata) using .rows().show():

mt.rows().show()
info
locus
alleles
rsid
qual
filters
AC
AF
AN
BaseQRankSum
ClippingRankSum
DB
DP
FS
InbreedingCoeff
MQ
MQRankSum
QD
ReadPosRankSum
SOR
VQSLOD
VQSR_culprit
VQSR_NEGATIVE_TRAIN_SITE
VQSR_POSITIVE_TRAIN_SITE
GQ_HIST_ALT
DP_HIST_ALT
AB_HIST_ALT
GQ_HIST_ALL
DP_HIST_ALL
AB_HIST_ALL
AC_AFR
AC_AMR
AC_ASJ
AC_EAS
AC_FIN
AC_NFE
AC_OTH
AC_SAS
AC_Male
AC_Female
AN_AFR
AN_AMR
AN_ASJ
AN_EAS
AN_FIN
AN_NFE
AN_OTH
AN_SAS
AN_Male
AN_Female
AF_AFR
AF_AMR
AF_ASJ
AF_EAS
AF_FIN
AF_NFE
AF_OTH
AF_SAS
AF_Male
AF_Female
GC_AFR
GC_AMR
GC_ASJ
GC_EAS
GC_FIN
GC_NFE
GC_OTH
GC_SAS
GC_Male
GC_Female
AC_raw
AN_raw
AF_raw
GC_raw
GC
Hom_AFR
Hom_AMR
Hom_ASJ
Hom_EAS
Hom_FIN
Hom_NFE
Hom_OTH
Hom_SAS
Hom_Male
Hom_Female
Hom_raw
Hom
STAR_AC
STAR_AC_raw
STAR_Hom
POPMAX
AC_POPMAX
AN_POPMAX
AF_POPMAX
DP_MEDIAN
DREF_MEDIAN
GQ_MEDIAN
AB_MEDIAN
AS_RF
AS_FilterStatus
AS_RF_POSITIVE_TRAIN
AS_RF_NEGATIVE_TRAIN
AC_AFR_Male
AC_AMR_Male
AC_ASJ_Male
AC_EAS_Male
AC_FIN_Male
AC_NFE_Male
AC_OTH_Male
AC_SAS_Male
AC_AFR_Female
AC_AMR_Female
AC_ASJ_Female
AC_EAS_Female
AC_FIN_Female
AC_NFE_Female
AC_OTH_Female
AC_SAS_Female
AN_AFR_Male
AN_AMR_Male
AN_ASJ_Male
AN_EAS_Male
AN_FIN_Male
AN_NFE_Male
AN_OTH_Male
AN_SAS_Male
AN_AFR_Female
AN_AMR_Female
AN_ASJ_Female
AN_EAS_Female
AN_FIN_Female
AN_NFE_Female
AN_OTH_Female
AN_SAS_Female
AF_AFR_Male
AF_AMR_Male
AF_ASJ_Male
AF_EAS_Male
AF_FIN_Male
AF_NFE_Male
AF_OTH_Male
AF_SAS_Male
AF_AFR_Female
AF_AMR_Female
AF_ASJ_Female
AF_EAS_Female
AF_FIN_Female
AF_NFE_Female
AF_OTH_Female
AF_SAS_Female
GC_AFR_Male
GC_AMR_Male
GC_ASJ_Male
GC_EAS_Male
GC_FIN_Male
GC_NFE_Male
GC_OTH_Male
GC_SAS_Male
GC_AFR_Female
GC_AMR_Female
GC_ASJ_Female
GC_EAS_Female
GC_FIN_Female
GC_NFE_Female
GC_OTH_Female
GC_SAS_Female
Hemi_AFR
Hemi_AMR
Hemi_ASJ
Hemi_EAS
Hemi_FIN
Hemi_NFE
Hemi_OTH
Hemi_SAS
Hemi
Hemi_raw
STAR_Hemi
locus<GRCh38> array<str> str float64 set<str> array<int32> array<float64> int32 float64 float64 bool int32 float64 float64 float64 float64 float64 float64 float64 float64 str bool bool array<str> array<str> array<str> str str str array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> int32 int32 int32 int32 int32 int32 int32 int32 int32 int32 array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> int32 array<float64> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> int32 int32 int32 array<str> array<int32> array<int32> array<float64> array<int32> array<float64> array<int32> array<float64> array<float64> array<str> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> int32 int32 int32 int32 int32 int32 int32 int32 int32 int32 int32 int32 int32 int32 int32 int32 array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<float64> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> array<int32> int32
chr1:12198 ["G","C"] "chr1_12198_G_C" 9.88e+03 {"AC0"} [0] NA 0 0.00e+00 3.58e-01 False 9204 0.00e+00 9.80e-03 2.30e+01 7.36e-01 1.40e+01 7.36e-01 3.02e-01 1.01e+00 "MQ" False False ["14|79|1|25|7|4|0|5|2|0|0|0|1|0|0|0|0|0|0|0"] ["131|7|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|0|1|0|2|0|2|0|10|0|1|28|0|3|0|0|0|0"] "1859|505|25|28|8|4|0|5|2|0|0|0|1|0|0|0|0|0|0|0" "2413|24|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|0|1|0|2|0|2|0|10|0|1|28|0|3|0|0|0|0" [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA NA NA NA NA NA [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [229] 4874 [4.70e-02] [2299,47,91] [0,0,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [91] [0] NA NA NA NA NA NA NA [2] [4.01e-06] [6] [6.67e-01] [1.27e-01] ["AC0"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12237 ["G","A"] "chr1_12237_G_A" 8.20e+01 {"RF","AC0"} [0] NA 0 7.36e-01 -3.58e-01 False 16096 0.00e+00 -1.44e-01 2.20e+01 -3.58e-01 4.31e+00 3.58e-01 1.29e+00 1.57e+00 "QD" False False ["0|1|0|1|1|0|0|0|2|0|0|0|0|0|0|0|0|0|0|0"] ["3|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|0|0|0|0|0|0|0|1|0|2|1|0|0|0|0|0|0"] "3537|1722|177|100|15|1|0|0|2|0|0|0|0|0|0|0|0|0|0|0" "5428|126|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|0|0|0|0|0|0|0|1|0|2|1|0|0|0|0|0|0" [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA NA NA NA NA NA [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [6] 11108 [5.40e-04] [5549,4,1] [0,0,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [1] [0] NA NA NA NA NA NA NA [4] [4.01e-06] [22] [6.00e-01] [8.80e-02] ["RF|AC0"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12259 ["G","C"] "chr1_12259_G_C" 3.74e+01 {"RF","AC0"} [0] [0.00e+00] 2 NA NA False 18814 0.00e+00 -1.42e-01 2.00e+01 NA 9.36e+00 NA 6.93e-01 2.24e+00 "MQ" False False ["0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] "3885|1993|253|143|22|3|1|0|0|0|0|0|0|0|0|0|0|0|0|0" "6117|182|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0" [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] 0 0 0 0 0 0 0 2 2 0 NA NA NA NA NA NA NA [0.00e+00] [0.00e+00] NA [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [1,0,0] [1,0,0] [0,0,0] [2] 12600 [1.59e-04] [6299,0,1] [1,0,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [1] [0] NA NA NA NA NA NA NA [2] [1.01e-05] [6] [4.62e-01] [4.93e-02] ["RF|AC0"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12266 ["G","A"] "chr1_12266_G_A" 2.72e+03 {"AC0"} [0] NA 0 8.04e-01 0.00e+00 False 18372 0.00e+00 -1.38e-01 2.23e+01 -7.27e-01 1.61e+01 4.06e-01 2.03e-01 1.27e+00 "MQ" False False ["0|15|0|0|0|2|0|2|5|0|1|1|1|3|0|6|0|0|0|0"] ["33|3|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|0|0|2|0|0|2|0|9|0|0|8|0|0|0|0|0|0"] "3755|1684|188|101|11|4|0|2|5|0|1|1|1|3|0|6|0|0|0|0" "5598|163|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|0|0|2|0|0|2|0|9|0|0|8|0|0|0|0|0|0" [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA NA NA NA NA NA [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [51] 11524 [4.43e-03] [5726,21,15] [0,0,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [15] [0] NA NA NA NA NA NA NA [3] [1.19e-08] [39] [5.00e-01] [2.37e-01] ["AC0"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12272 ["G","A"] "chr1_12272_G_A" 2.71e+03 {"AC0"} [0] [0.00e+00] 2 8.04e-01 3.58e-01 False 17685 0.00e+00 -1.34e-01 2.23e+01 0.00e+00 1.69e+01 0.00e+00 1.96e-01 3.60e-02 "MQ" False False ["0|14|0|0|0|2|0|2|5|0|1|1|1|3|0|6|0|0|0|0"] ["31|4|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|0|0|0|1|0|3|0|10|0|0|7|0|0|0|0|0|0"] "3715|1603|160|87|9|2|1|2|5|0|1|1|1|3|0|6|0|0|0|0" "5448|147|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|0|0|0|1|0|3|0|10|0|0|7|0|0|0|0|0|0" [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] 0 2 0 0 0 0 0 0 0 2 NA [0.00e+00] NA NA NA NA NA NA NA [0.00e+00] [0,0,0] [1,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [0,0,0] [1,0,0] [49] 11192 [4.38e-03] [5561,21,14] [1,0,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [14] [0] NA NA NA NA NA NA NA [3] [7.94e-09] [39] [5.00e-01] [2.32e-01] ["AC0"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12554 ["A","G"] "chr1_12554_A_G" 6.81e+01 {"RF","AC0"} [0] [0.00e+00] 3022 1.06e+00 0.00e+00 False 138436 0.00e+00 -8.07e-02 2.25e+01 3.22e-01 6.20e-01 -6.70e-02 1.88e-01 1.94e+00 "QD" False False ["0|2|0|1|1|1|3|1|2|0|0|0|0|0|0|0|0|0|0|0"] ["3|6|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|2|2|2|0|0|2|0|1|0|0|0|0|0|0|0|0|0"] "4802|6590|2323|3118|1771|582|666|341|111|143|70|21|22|8|10|4|4|0|0|0" "13350|5670|1233|271|48|14|0|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|2|2|2|0|0|2|0|1|0|0|0|0|0|0|0|0|0" [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] 374 564 22 378 16 542 84 1042 1786 1236 [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [187,0,0] [282,0,0] [11,0,0] [189,0,0] [8,0,0] [271,0,0] [42,0,0] [521,0,0] [893,0,0] [618,0,0] [13] 41172 [3.16e-04] [20575,9,2] [1511,0,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [2] [0] NA NA NA NA NA NA NA [7] [3.98e-04] [32] [2.86e-01] [8.10e-03] ["RF|AC0"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12559 ["G","A"] "chr1_12559_G_A" 1.67e+03 {"RF"} [15] [5.08e-03] 2952 1.50e+00 -1.18e-01 False 139640 0.00e+00 -8.47e-02 2.24e+01 -1.98e-01 1.65e+00 0.00e+00 1.74e-01 2.13e+00 "QD" False False ["3|8|7|16|9|13|13|23|10|3|1|4|2|0|1|0|0|0|0|2"] ["30|56|28|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|5|10|17|24|12|4|12|0|11|0|0|9|0|2|0|0|0|0"] "5100|6458|2253|3094|1723|523|669|372|104|139|59|21|24|5|9|1|3|0|0|2" "13224|5716|1280|271|52|16|0|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|5|10|17|24|12|4|12|0|11|0|0|9|0|2|0|0|0|0" [0] [3] [0] [1] [0] [1] [1] [9] [9] [6] 426 552 22 376 18 564 82 912 1742 1210 [0.00e+00] [5.43e-03] [0.00e+00] [2.66e-03] [0.00e+00] [1.77e-03] [1.22e-02] [9.87e-03] [5.17e-03] [4.96e-03] [213,0,0] [273,3,0] [11,0,0] [187,1,0] [9,0,0] [281,1,0] [40,1,0] [447,9,0] [862,9,0] [599,6,0] [124] 41118 [3.02e-03] [20444,106,9] [1461,15,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [9] [0] NA NA NA ["SAS"] [9] [912] [9.87e-03] [7] [3.16e-04] [30] [2.86e-01] [7.73e-03] ["RF"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12573 ["T","C"] "chr1_12573_T_C" 3.67e+02 {} [2] [4.80e-04] 4168 1.56e-01 -3.87e-01 False 159660 0.00e+00 -7.56e-02 2.20e+01 -7.20e-01 4.03e+00 4.06e-01 5.91e-01 3.25e+00 "QD" False False ["0|0|0|2|0|0|1|1|0|0|0|1|0|0|0|0|0|0|0|2"] ["2|3|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|0|0|1|1|0|0|1|1|0|1|1|0|1|0|0|0|0"] "4681|6449|2465|3412|2043|675|838|436|128|204|107|32|44|1|14|0|6|1|1|2" "12932|6401|1677|420|84|22|3|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|0|0|1|1|0|0|1|1|0|1|1|0|1|0|0|0|0" [0] [2] [0] [0] [0] [0] [0] [0] [0] [2] 622 716 44 528 36 850 124 1248 2348 1820 [0.00e+00] [2.79e-03] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [1.10e-03] [311,0,0] [356,2,0] [22,0,0] [264,0,0] [18,0,0] [425,0,0] [62,0,0] [624,0,0] [1174,0,0] [908,2,0] [7] 43078 [1.62e-04] [21532,7,0] [2082,2,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] NA NA NA ["AMR"] [2] [716] [2.79e-03] [8] [1.54e-07] [36] [5.33e-01] [2.70e-01] ["PASS"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12586 ["C","T"] "chr1_12586_C_T" 2.24e+02 {} [2] [5.63e-04] 3550 2.93e+00 -1.24e-01 False 149739 0.00e+00 -7.86e-02 2.20e+01 2.89e-01 5.74e+00 2.89e-01 5.57e-01 2.25e+00 "MQ" False False ["0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|2"] ["0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|0|0|0|0|1|1|0|0|0|0|0|1|0|0|0|0|0"] "4597|6767|2443|3466|1994|601|758|390|126|150|87|21|20|5|8|4|3|0|0|2" "13424|6201|1423|326|52|15|1|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|0|0|0|0|1|1|0|0|0|0|0|1|0|0|0|0|0" [0] [0] [0] [1] [0] [0] [0] [1] [0] [2] 530 606 34 434 34 692 106 1114 2050 1500 [0.00e+00] [0.00e+00] [0.00e+00] [2.30e-03] [0.00e+00] [0.00e+00] [0.00e+00] [8.98e-04] [0.00e+00] [1.33e-03] [265,0,0] [303,0,0] [17,0,0] [216,1,0] [17,0,0] [346,0,0] [53,0,0] [556,1,0] [1025,0,0] [748,2,0] [3] 42884 [7.00e-05] [21439,3,0] [1773,2,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] NA NA NA ["EAS"] [1] [434] [2.30e-03] [13] [5.01e-12] [98] [4.21e-01] [3.83e-01] ["PASS"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
chr1:12596 ["C","A"] "chr1_12596_C_A" 4.48e+01 {"AC0"} [0] [0.00e+00] 2932 2.22e+00 3.61e-01 False 140242 0.00e+00 -8.29e-02 2.20e+01 0.00e+00 4.97e+00 3.61e-01 8.92e-01 2.41e+00 "MQ" False False ["0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0"] ["0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0"] ["0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0"] "4727|7110|2504|3416|1817|550|609|333|79|110|46|6|19|4|0|3|1|0|0|0" "13901|5942|1199|251|36|5|0|0|0|0|0|0|0|0|0|0|0|0|0|0" "0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0" [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] 490 436 22 360 22 602 78 922 1670 1262 [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [0.00e+00] [245,0,0] [218,0,0] [11,0,0] [180,0,0] [11,0,0] [301,0,0] [39,0,0] [461,0,0] [835,0,0] [631,0,0] [1] 42668 [2.34e-05] [21333,1,0] [1466,0,0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] [0] NA NA NA NA NA NA NA [9] [2.00e-11] [80] [5.56e-01] [4.03e-01] ["AC0"] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

showing top 10 rows

If we do a .describe(), we will see more info about the row fields.

There is only one column field, s, which is the key that maps to the Sample ID. We will add more column fields when we merge in the Pheno Data in the next section.

# View structure of MT before adding pheno data

mt.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

We can look at the first few rows and columns of the genotyping table with mt.GT.show(). Note that it includes the locus/alleles column. This is because the row key corresponds to a combination of both locus and alleles.

mt.GT.show()
'sample_1_0'
'sample_1_1'
'sample_1_2'
'sample_1_3'
locus
alleles
GT
GT
GT
GT
locus<GRCh38> array<str> call call call call
chr1:12198 ["G","C"] 0/1 0/0 1/1 0/0
chr1:12237 ["G","A"] 1/1 0/0 0/1 0/0
chr1:12259 ["G","C"] 1/1 0/0 0/1 0/0
chr1:12266 ["G","A"] 0/1 1/1 0/1 0/0
chr1:12272 ["G","A"] 0/0 0/1 0/0 0/1
chr1:12554 ["A","G"] 0/0 0/0 0/0 0/0
chr1:12559 ["G","A"] 0/0 0/0 0/0 0/0
chr1:12573 ["T","C"] 0/0 0/0 0/0 0/0
chr1:12586 ["C","T"] 0/0 0/0 0/0 0/0
chr1:12596 ["C","A"] 0/0 0/0 0/0 0/0

showing top 10 rows

showing the first 4 of 100000 columns

7.5 Load and Merge Pheno Table

Phenotypic data may come from an array of sources, such as a cohort from the Cohort Browser or a separate, stand-alone text file. In this notebook, we use phenotypic data from a CSV file, which was previously uploaded to a project. In this (very basic) example we use the phenotypic trait, is_case, for each sample. The values indicate if the sample is a case, is_case=true, or a control, is_case=false

Note: Although this notebook provides an example using a stand-alone CSV file, another source of phenotypic data could be derived from an Apollo Dataset. Please refer to Table Exporter documentation (https://documentation.dnanexus.com/developer/apps/developing-spark-apps/table-exporter-application) or OpenBio notebooks (https://github.com/dnanexus/OpenBio/tree/master/dx-toolkit) with extract_dataset in the title for guidance on how to extract data from an Apollo Dataset.

All data uploaded to the project before running the JupyterLab app is mounted (https://documentation.dnanexus.com/user/jupyter-notebooks?#accessing-data) and can be accessed in /mnt/project/<path_to_data>. The file URL follows the format: file:///mnt/project/<path_to_data>

# Import the pheno CSV file as a Hail Table

pheno_table = hl.import_table("file:///mnt/project/data/ukbb_100k_bmi_casecontrol.csv",
                              delimiter=',',
                              impute=True,
                              key='iid') # specify the column that will be the key (values must match what is in the MT 's' column)

pheno_table.show()
2023-04-19 22:30:24 Hail: INFO: Reading table to impute column types
2023-04-19 22:30:29 Hail: INFO: Finished type imputation
  Loading field 'fid' as type str (imputed)
  Loading field 'iid' as type str (imputed)
  Loading field 'father_iid' as type int32 (imputed)
  Loading field 'mother_iid' as type int32 (imputed)
  Loading field 'sex_code' as type int32 (imputed)
  Loading field 'case_control_status' as type int32 (imputed)
fid
iid
father_iid
mother_iid
sex_code
case_control_status
str str int32 int32 int32 int32
"sample_1_10" "sample_1_10" 0 0 1 2
"sample_1_10003" "sample_1_10003" 0 0 2 2
"sample_1_10014" "sample_1_10014" 0 0 2 1
"sample_1_10020" "sample_1_10020" 0 0 1 2
"sample_1_10021" "sample_1_10021" 0 0 2 1
"sample_1_10023" "sample_1_10023" 0 0 1 1
"sample_1_10036" "sample_1_10036" 0 0 2 1
"sample_1_10037" "sample_1_10037" 0 0 1 1
"sample_1_10047" "sample_1_10047" 0 0 1 1
"sample_1_10056" "sample_1_10056" 0 0 1 2

showing top 10 rows

One thing I always do when I’m loading in a Hail Table is use .summarize() on the entire Table. This gives me a high level overview of the Table.

pheno_table.summarize()
2023-04-19 18:58:14 Hail: INFO: Coerced sorted dataset

19884 records.

  • fid (str):
      Non-missing 19884 (100.00%)
      Missing 0
      Min Size 10
      Max Size 14
      Mean Size 13.89
      Sample Values ['sample_1_10', 'sample_1_10003', 'sample_1_10014', 'sample_1_10020', 'sample_1_10021']

      iid (str):

      father_iid (int32):

      mother_iid (int32):

      sex_code (int32):

      case_control_status (int32):

      Non-missing 19884 (100.00%)
      Missing 0
      Minimum 1
      Maximum 2
      Mean 1.50
      Std Dev 0.50

We need to modify case_control_status in our table, because it needs to be 0/1 instead of 1/2 to work with Hail’s .logistic_regression() method.

We can do this with .annotate(), which will add a column to our table. We will subtract 1 from both case_control_status and sex_code.

1pheno_fixed = pheno_table.annotate(ccs = pheno_table.case_control_status - 1,
2                                  sc = pheno_table.sex_code - 1)
pheno_fixed.show()
1
Subtract 1 from the Case Control Status
2
Subtract 2 from the Sex Code
2023-04-19 22:30:36 Hail: INFO: Coerced sorted dataset
fid
iid
father_iid
mother_iid
sex_code
case_control_status
ccs
sc
str str int32 int32 int32 int32 int32 int32
"sample_1_10" "sample_1_10" 0 0 1 2 1 0
"sample_1_10003" "sample_1_10003" 0 0 2 2 1 1
"sample_1_10014" "sample_1_10014" 0 0 2 1 0 1
"sample_1_10020" "sample_1_10020" 0 0 1 2 1 0
"sample_1_10021" "sample_1_10021" 0 0 2 1 0 1
"sample_1_10023" "sample_1_10023" 0 0 1 1 0 0
"sample_1_10036" "sample_1_10036" 0 0 2 1 0 1
"sample_1_10037" "sample_1_10037" 0 0 1 1 0 0
"sample_1_10047" "sample_1_10047" 0 0 1 1 0 0
"sample_1_10056" "sample_1_10056" 0 0 1 2 1 0

showing top 10 rows

If we .describe(), we’ll see that we added two row fields: ccs and sc.

# View structure of pheno Table

pheno_fixed.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'fid': str 
    'iid': str 
    'father_iid': int32 
    'mother_iid': int32 
    'sex_code': int32 
    'case_control_status': int32 
    'ccs': int32 
    'sc': int32 
----------------------------------------
Key: ['iid']
----------------------------------------
pheno_fixed.show(10)
2023-04-11 18:45:20 Hail: INFO: Coerced sorted dataset
fid
iid
father_iid
mother_iid
sex_code
case_control_status
ccs
sc
str str int32 int32 int32 int32 int32 int32
"sample_1_10" "sample_1_10" 0 0 1 2 1 0
"sample_1_10003" "sample_1_10003" 0 0 2 2 1 1
"sample_1_10014" "sample_1_10014" 0 0 2 1 0 1
"sample_1_10020" "sample_1_10020" 0 0 1 2 1 0
"sample_1_10021" "sample_1_10021" 0 0 2 1 0 1
"sample_1_10023" "sample_1_10023" 0 0 1 1 0 0
"sample_1_10036" "sample_1_10036" 0 0 2 1 0 1
"sample_1_10037" "sample_1_10037" 0 0 1 1 0 0
"sample_1_10047" "sample_1_10047" 0 0 1 1 0 0
"sample_1_10056" "sample_1_10056" 0 0 1 2 1 0

showing top 10 rows

7.5.1 Annotate MT with pheno Table

# Annotate the MT with pheno Table by matching the MT's column key ('s') with the pheno Table's key ('sample_id')

phenogeno_mt = mt.annotate_cols(**pheno_fixed[mt.s])

If we look at what happened in the merge, we will see that there are missing values in the pheno data. That’s because the pheno file only had 20000 rows. This means we’ll drop a large amount of the population when we actually run the GWAS, because they have no case/control status nor do they have covariates.

phenogeno_mt.col.show(20)
s
fid
father_iid
mother_iid
sex_code
case_control_status
ccs
sc
str str int32 int32 int32 int32 int32 int32
"sample_1_0" NA NA NA NA NA NA NA
"sample_1_1" NA NA NA NA NA NA NA
"sample_1_2" NA NA NA NA NA NA NA
"sample_1_3" "sample_1_3" 0 0 2 1 0 1
"sample_1_4" NA NA NA NA NA NA NA
"sample_1_5" NA NA NA NA NA NA NA
"sample_1_6" NA NA NA NA NA NA NA
"sample_1_7" NA NA NA NA NA NA NA
"sample_1_8" NA NA NA NA NA NA NA
"sample_1_9" NA NA NA NA NA NA NA
"sample_1_10" "sample_1_10" 0 0 1 2 1 0
"sample_1_11" NA NA NA NA NA NA NA
"sample_1_12" NA NA NA NA NA NA NA
"sample_1_13" NA NA NA NA NA NA NA
"sample_1_14" NA NA NA NA NA NA NA
"sample_1_15" NA NA NA NA NA NA NA
"sample_1_16" NA NA NA NA NA NA NA
"sample_1_17" "sample_1_17" 0 0 2 1 0 1
"sample_1_18" NA NA NA NA NA NA NA
"sample_1_19" NA NA NA NA NA NA NA

showing top 20 rows

After merging in the pheno table, we see that we’ve added a number of fields to Column Fields. This includes fid, sex_code, case_control_status and the columns we derived from them.

# View structure of MT after annotating with pheno Table

phenogeno_mt.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'fid': str
    'father_iid': int32
    'mother_iid': int32
    'sex_code': int32
    'case_control_status': int32
    'ccs': int32
    'sc': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

We see that the pheno traits have been added in the column fields of the MT

7.6 Filter MT using Sample QC Tables

image.png

7.6.1 Filter sample QC Table

The above figure illustrates the basic process for filtering on Sample QC metrics. We first need to build the sample_qc table called pre_sample_qc_tb. Then we will extract the row fields using the .row() accessor and write it to DNAX using pre_sample_qc_table.write(). (see https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/sample_qc.ipynb for more info)

This table was built using:

pre_sample_qc_tb = hl.sample_qc(mt).cols()
# Define sample QC Table url

sample_qc_url = "dnax://database-GQYV2Bj04bPg9X3KfFyj2jyY/sample_qc.ht"
# Read sample QC Table

pre_sample_qc_tb = hl.read_table(sample_qc_url)
# View structure of sample QC Table

pre_sample_qc_tb.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    's': str 
    'sample_qc': struct {
        dp_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_hom_ref: int64, 
        n_het: int64, 
        n_hom_var: int64, 
        n_non_ref: int64, 
        n_singleton: int64, 
        n_snp: int64, 
        n_insertion: int64, 
        n_deletion: int64, 
        n_transition: int64, 
        n_transversion: int64, 
        n_star: int64, 
        r_ti_tv: float64, 
        r_het_hom_var: float64, 
        r_insertion_deletion: float64
    } 
----------------------------------------
Key: ['s']
----------------------------------------

Let’s plot the call rate across the samples. This will help us decide what our cutoff should be.

from bokeh.io import output_notebook, show
output_notebook()

p = hl.plot.histogram(pre_sample_qc_tb["sample_qc"]["call_rate"])
show(p)
Loading BokehJS ...

Let’s filter for samples that have a call rate greater or equal to 0.99

# Filter sample QC Table using expressions
# Note: Viewing the structure of the sample QC table in from the cell above 
# shows us that the "call_rate" field is within the "sample_qc" struct field

sample_qc_tb = pre_sample_qc_tb.filter(
    pre_sample_qc_tb["sample_qc"]["call_rate"] >= 0.99) 

If we do a count before and after filtering, we’ll see we dropped roughly 50000 participants using this filtering step.

# View number of samples in QC Table before and after filtering
#
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

print(f"Num samples before filtering: {pre_sample_qc_tb.count()}")
print(f"Num samples after filtering: {sample_qc_tb.count()}")
Num samples before filtering: 100000
Num samples after filtering: 50062

7.6.2 Filter MT with Sample QC Table

Now we can use .semi_join_cols() to filter our phenogeno_mt with sample_qc_tb.

# Filter the MT using the sample QC Table

qc_mt = phenogeno_mt.semi_join_cols(sample_qc_tb)

7.7 Build Variant QC table from MT, use to filter MT

Similar to the Sample QC, we can calculate a variant QC table for our MatrixTable, and extract it as a separate table and store it in DNAX. Then we can filter on this table, and use .semi_join_rows() to filter our MatrixTable.

7.7.1 4a) Extract Locus QC table

image.png

This table was built using

pre_locus_qc_tb = hl.variant_qc(mt).rows() 

(see https://github.com/dnanexus/OpenBio/blob/master/hail_tutorial/locus_qc.ipynb for more info)

# Define locus QC Table url

locus_qc_url = "dnax://database-GQYV2Bj04bPg9X3KfFyj2jyY/variant_qc.ht"
# Read locus QC Table

pre_locus_qc_tb = hl.read_table(locus_qc_url)
# View structure of locus QC Table

pre_locus_qc_tb.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh38> 
    'alleles': array<str> 
    'rsid': str 
    'qual': float64 
    'filters': set<str> 
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    } 
    'variant_qc': struct {
        dp_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        homozygote_count: array<int32>, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_het: int64, 
        n_non_ref: int64, 
        het_freq_hwe: float64, 
        p_value_hwe: float64
    } 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------

Let’s filter for loci that have: - an allele frequency (AF) value between 0.001-0.999, - a Hardy-Weinberg Equilibrium p-value (p_value_hwe) greater or equal to 1e-10, - a call rate (call_rate) greater or equal to 0.9

# Filter QC Table using expressions
# Note: Viewing the structure of the locus QC table in from the cell above 
# shows us that the "AF", "p_value_hwe", and "call_rate" fields are within
# the "variant_qc" struct field.

locus_qc_tb = pre_locus_qc_tb.filter(
    (pre_locus_qc_tb["variant_qc"]["AF"][0] >= 0.001) & 
    (pre_locus_qc_tb["variant_qc"]["AF"][0] <= 0.999) & 
    (pre_locus_qc_tb["variant_qc"]["p_value_hwe"] >= 1e-10) & 
    (pre_locus_qc_tb["variant_qc"]["call_rate"] >= 0.9)
)
# DON'T RUN in class
# View number of loci in QC Table before and after filtering
#
# Note: running this cell can be computationally expensive and take
# longer for bigger datasets (this cell can be commented out).

print(f"Num loci before filtering: {pre_locus_qc_tb.count()}")
print(f"Num loci after filtering: {locus_qc_tb.count()}")
Num loci before filtering: 3403513

Num loci after filtering: 39952

7.7.1.1 4b) Filter MT with variant QC Tables

Now we can use .semi_join_rows() to filter our phenogeno_mt with locus_qc_tb.

# Filter the MT using the locus QC Table

qc_mt = qc_mt.semi_join_rows(locus_qc_tb)

After filtereing, we have reduced the number of loci in our MatrixTable to be 39952 rows.

# DON'T RUN in class
# View MT after QC filters
# 
# Note: running 'mt.rows().count()' or 'mt.cols().count()' can be computationally 
# expensive and take longer for bigger datasets (these lines can be commented out).

print(f"Num partitions: {qc_mt.n_partitions()}")
print(f"Num loci: {qc_mt.rows().count()}")
print(f"Num samples: {qc_mt.cols().count()}")
qc_mt.describe()
Num partitions: 228
Num loci: 39952
2023-04-19 22:33:57 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
Num samples: 50062
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'fid': str
    'father_iid': int32
    'mother_iid': int32
    'sex_code': int32
    'case_control_status': int32
    'ccs': int32
    'sc': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        BaseQRankSum: float64, 
        ClippingRankSum: float64, 
        DB: bool, 
        DP: int32, 
        FS: float64, 
        InbreedingCoeff: float64, 
        MQ: float64, 
        MQRankSum: float64, 
        QD: float64, 
        ReadPosRankSum: float64, 
        SOR: float64, 
        VQSLOD: float64, 
        VQSR_culprit: str, 
        VQSR_NEGATIVE_TRAIN_SITE: bool, 
        VQSR_POSITIVE_TRAIN_SITE: bool, 
        GQ_HIST_ALT: array<str>, 
        DP_HIST_ALT: array<str>, 
        AB_HIST_ALT: array<str>, 
        GQ_HIST_ALL: str, 
        DP_HIST_ALL: str, 
        AB_HIST_ALL: str, 
        AC_AFR: array<int32>, 
        AC_AMR: array<int32>, 
        AC_ASJ: array<int32>, 
        AC_EAS: array<int32>, 
        AC_FIN: array<int32>, 
        AC_NFE: array<int32>, 
        AC_OTH: array<int32>, 
        AC_SAS: array<int32>, 
        AC_Male: array<int32>, 
        AC_Female: array<int32>, 
        AN_AFR: int32, 
        AN_AMR: int32, 
        AN_ASJ: int32, 
        AN_EAS: int32, 
        AN_FIN: int32, 
        AN_NFE: int32, 
        AN_OTH: int32, 
        AN_SAS: int32, 
        AN_Male: int32, 
        AN_Female: int32, 
        AF_AFR: array<float64>, 
        AF_AMR: array<float64>, 
        AF_ASJ: array<float64>, 
        AF_EAS: array<float64>, 
        AF_FIN: array<float64>, 
        AF_NFE: array<float64>, 
        AF_OTH: array<float64>, 
        AF_SAS: array<float64>, 
        AF_Male: array<float64>, 
        AF_Female: array<float64>, 
        GC_AFR: array<int32>, 
        GC_AMR: array<int32>, 
        GC_ASJ: array<int32>, 
        GC_EAS: array<int32>, 
        GC_FIN: array<int32>, 
        GC_NFE: array<int32>, 
        GC_OTH: array<int32>, 
        GC_SAS: array<int32>, 
        GC_Male: array<int32>, 
        GC_Female: array<int32>, 
        AC_raw: array<int32>, 
        AN_raw: int32, 
        AF_raw: array<float64>, 
        GC_raw: array<int32>, 
        GC: array<int32>, 
        Hom_AFR: array<int32>, 
        Hom_AMR: array<int32>, 
        Hom_ASJ: array<int32>, 
        Hom_EAS: array<int32>, 
        Hom_FIN: array<int32>, 
        Hom_NFE: array<int32>, 
        Hom_OTH: array<int32>, 
        Hom_SAS: array<int32>, 
        Hom_Male: array<int32>, 
        Hom_Female: array<int32>, 
        Hom_raw: array<int32>, 
        Hom: array<int32>, 
        STAR_AC: int32, 
        STAR_AC_raw: int32, 
        STAR_Hom: int32, 
        POPMAX: array<str>, 
        AC_POPMAX: array<int32>, 
        AN_POPMAX: array<int32>, 
        AF_POPMAX: array<float64>, 
        DP_MEDIAN: array<int32>, 
        DREF_MEDIAN: array<float64>, 
        GQ_MEDIAN: array<int32>, 
        AB_MEDIAN: array<float64>, 
        AS_RF: array<float64>, 
        AS_FilterStatus: array<str>, 
        AS_RF_POSITIVE_TRAIN: array<int32>, 
        AS_RF_NEGATIVE_TRAIN: array<int32>, 
        AC_AFR_Male: array<int32>, 
        AC_AMR_Male: array<int32>, 
        AC_ASJ_Male: array<int32>, 
        AC_EAS_Male: array<int32>, 
        AC_FIN_Male: array<int32>, 
        AC_NFE_Male: array<int32>, 
        AC_OTH_Male: array<int32>, 
        AC_SAS_Male: array<int32>, 
        AC_AFR_Female: array<int32>, 
        AC_AMR_Female: array<int32>, 
        AC_ASJ_Female: array<int32>, 
        AC_EAS_Female: array<int32>, 
        AC_FIN_Female: array<int32>, 
        AC_NFE_Female: array<int32>, 
        AC_OTH_Female: array<int32>, 
        AC_SAS_Female: array<int32>, 
        AN_AFR_Male: int32, 
        AN_AMR_Male: int32, 
        AN_ASJ_Male: int32, 
        AN_EAS_Male: int32, 
        AN_FIN_Male: int32, 
        AN_NFE_Male: int32, 
        AN_OTH_Male: int32, 
        AN_SAS_Male: int32, 
        AN_AFR_Female: int32, 
        AN_AMR_Female: int32, 
        AN_ASJ_Female: int32, 
        AN_EAS_Female: int32, 
        AN_FIN_Female: int32, 
        AN_NFE_Female: int32, 
        AN_OTH_Female: int32, 
        AN_SAS_Female: int32, 
        AF_AFR_Male: array<float64>, 
        AF_AMR_Male: array<float64>, 
        AF_ASJ_Male: array<float64>, 
        AF_EAS_Male: array<float64>, 
        AF_FIN_Male: array<float64>, 
        AF_NFE_Male: array<float64>, 
        AF_OTH_Male: array<float64>, 
        AF_SAS_Male: array<float64>, 
        AF_AFR_Female: array<float64>, 
        AF_AMR_Female: array<float64>, 
        AF_ASJ_Female: array<float64>, 
        AF_EAS_Female: array<float64>, 
        AF_FIN_Female: array<float64>, 
        AF_NFE_Female: array<float64>, 
        AF_OTH_Female: array<float64>, 
        AF_SAS_Female: array<float64>, 
        GC_AFR_Male: array<int32>, 
        GC_AMR_Male: array<int32>, 
        GC_ASJ_Male: array<int32>, 
        GC_EAS_Male: array<int32>, 
        GC_FIN_Male: array<int32>, 
        GC_NFE_Male: array<int32>, 
        GC_OTH_Male: array<int32>, 
        GC_SAS_Male: array<int32>, 
        GC_AFR_Female: array<int32>, 
        GC_AMR_Female: array<int32>, 
        GC_ASJ_Female: array<int32>, 
        GC_EAS_Female: array<int32>, 
        GC_FIN_Female: array<int32>, 
        GC_NFE_Female: array<int32>, 
        GC_OTH_Female: array<int32>, 
        GC_SAS_Female: array<int32>, 
        Hemi_AFR: array<int32>, 
        Hemi_AMR: array<int32>, 
        Hemi_ASJ: array<int32>, 
        Hemi_EAS: array<int32>, 
        Hemi_FIN: array<int32>, 
        Hemi_NFE: array<int32>, 
        Hemi_OTH: array<int32>, 
        Hemi_SAS: array<int32>, 
        Hemi: array<int32>, 
        Hemi_raw: array<int32>, 
        STAR_Hemi: int32
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

7.8 Run GWAS

Now that we’ve done our filtering, we can now run our GWAS on our filtered MatrixTable.

Quick reminder that our pheno file has a lot of missing values, so we’re going to be dropping a bunch of samples when we do our analysis.

qc_mt.ccs.show()
s
ccs
str int32
"sample_1_1" NA
"sample_1_3" 0
"sample_1_6" NA
"sample_1_9" NA
"sample_1_10" 1
"sample_1_11" NA
"sample_1_12" NA
"sample_1_13" NA
"sample_1_14" NA
"sample_1_15" NA

showing top 10 rows

Additional documentation: https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows

# Run Hail's logistic regression method

gwas = hl.logistic_regression_rows(test="firth",
                                   y=qc_mt.col.ccs, # the column field of the pheno trait we are looking at ('ccs')
                                   x=qc_mt.GT.n_alt_alleles(), # n_alt_alleles() returns the count of non-reference alleles
                                   covariates=[1, qc_mt.col.sc])
2023-04-19 22:34:55 Hail: WARN: 40104 of 50062 samples have a missing phenotype or covariate.
2023-04-19 22:34:55 Hail: INFO: logistic_regression_rows: running firth on 9958 samples for response variable y,
    with input variable x, and 2 additional covariates...
# View structure of GWAS results Table

gwas.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh38> 
    'alleles': array<str> 
    'beta': float64 
    'chi_sq_stat': float64 
    'p_value': float64 
    'fit': struct {
        n_iterations: int32, 
        converged: bool, 
        exploded: bool
    } 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------
gwas.show()
fit
locus
alleles
beta
chi_sq_stat
p_value
n_iterations
converged
exploded
locus<GRCh38> array<str> float64 float64 float64 int32 bool bool
chr1:12719 ["G","C"] 1.96e-02 1.35e-02 9.08e-01 3 True False
chr1:13380 ["C","G"] -3.00e-01 9.08e-01 3.41e-01 5 True False
chr1:13494 ["A","G"] -3.33e-01 6.83e-01 4.09e-01 5 True False
chr1:13504 ["G","A"] 1.50e-01 2.84e-01 5.94e-01 5 True False
chr1:17358 ["ACTT","A"] 4.46e-01 1.30e+00 2.54e-01 5 True False
chr1:17373 ["A","G"] -3.88e-02 4.94e-02 8.24e-01 4 True False
chr1:17379 ["G","A"] 4.65e-01 2.92e+00 8.72e-02 4 True False
chr1:17403 ["A","G"] 1.51e-01 1.41e-01 7.07e-01 5 True False
chr1:17408 ["C","G"] -8.03e-02 1.80e-01 6.71e-01 4 True False
chr1:69735 ["A","G"] 7.29e-02 3.18e-02 8.59e-01 5 True False

showing top 10 rows


pandas_gwas = gwas.to_pandas()
pandas_gwas.to_csv("gwas_results.csv")

7.9 Visualize GWAS results

Bokeh is a Python library that is included in this JupyterLab environment- which makes it easy for us to import.

We’ll need the output_notebook and show modules to make our plots.

from bokeh.io import output_notebook, show
output_notebook()
Loading BokehJS ...
qq_plot = hl.plot.qq(gwas.p_value)
show(qq_plot)
2023-04-19 22:12:04 Hail: INFO: Ordering unsorted dataset with network shuffle

We can generate a Manhattan Plot on our p-values using hl.plot.manhattan().

manhattan_plot = hl.plot.manhattan(gwas_tb.p_value)
show(manhattan_plot)

7.10 Annotate GWAS results

Now that we have our GWAS results, we’ll annotate our file using Hail’s experimental DB, which is hosted on Open Data on AWS.

Note that this resource is not available in the RAP region. Here we’re going to add GENCODE annotations to our matrix. There are a lot of other annotations available in the DB as well.

There is the query builder if you have access to the experimental annotations. This will help you write your db.annotate_rows_db() statement.

https://hail.is/docs/0.2/annotation_database_ui.html#database-query

db = hl.experimental.DB(region='us', cloud='aws')
ann_gwas_tb = db.annotate_rows_db(gwas, 'gencode')
ann_gwas_tb.show()
fit
locus
alleles
beta
chi_sq_stat
p_value
n_iterations
converged
exploded
gencode
locus<GRCh38> array<str> float64 float64 float64 int32 bool bool array<struct{ID: str, gene_id: str, gene_name: str, gene_type: str, level: str, type: str, gene_score: str, gene_strand: str, gene_phase: str}>
chr1:12719 ["G","C"] 1.96e-02 1.35e-02 9.08e-01 3 True False [("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]
chr1:13380 ["C","G"] -3.00e-01 9.08e-01 3.41e-01 5 True False [("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]
chr1:13494 ["A","G"] -3.33e-01 6.83e-01 4.09e-01 5 True False [("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]
chr1:13504 ["G","A"] 1.50e-01 2.84e-01 5.94e-01 5 True False [("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")]
chr1:17358 ["ACTT","A"] 4.46e-01 1.30e+00 2.54e-01 5 True False [("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:17373 ["A","G"] -3.88e-02 4.94e-02 8.24e-01 4 True False [("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:17379 ["G","A"] 4.65e-01 2.92e+00 8.72e-02 4 True False [("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:17403 ["A","G"] 1.51e-01 1.41e-01 7.07e-01 5 True False [("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:17408 ["C","G"] -8.03e-02 1.80e-01 6.71e-01 4 True False [("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")]
chr1:69735 ["A","G"] 7.29e-02 3.18e-02 8.59e-01 5 True False [("ENSG00000186092.6","ENSG00000186092.6","OR4F5","protein_coding","2","gene",".","+",".")]

showing top 10 rows

On RAP, we do not have access to this resource, as it isn’t available in the UKB RAP region, so we must use the Hail/VEP configuration (note we started with the Hail/VEP features).

hl.vep() requires a config file in JSON format. I have used the one available here: https://documentation.dnanexus.com/user/jupyter-notebooks/dxjupyterlab-spark-cluster#using-vep-with-hail

ann_gwas_tb2 = hl.vep(ann_gwas_tb, "file:///mnt/project/notebooks/config.json")
ann_gwas_tb2.show()
fit
vep
vep_proc_id
locus
alleles
beta
chi_sq_stat
p_value
n_iterations
converged
exploded
gencode
assembly_name
allele_string
ancestral
colocated_variants
context
end
id
input
intergenic_consequences
most_severe_consequence
motif_feature_consequences
regulatory_feature_consequences
seq_region_name
start
strand
transcript_consequences
variant_class
part_idx
block_idx
locus<GRCh38> array<str> float64 float64 float64 int32 bool bool array<struct{ID: str, gene_id: str, gene_name: str, gene_type: str, level: str, type: str, gene_score: str, gene_strand: str, gene_phase: str}> str str str array<struct{aa_allele: str, aa_maf: float64, afr_allele: str, afr_maf: float64, allele_string: str, amr_allele: str, amr_maf: float64, clin_sig: array<str>, end: int32, eas_allele: str, eas_maf: float64, ea_allele: str, ea_maf: float64, eur_allele: str, eur_maf: float64, exac_adj_allele: str, exac_adj_maf: float64, exac_allele: str, exac_afr_allele: str, exac_afr_maf: float64, exac_amr_allele: str, exac_amr_maf: float64, exac_eas_allele: str, exac_eas_maf: float64, exac_fin_allele: str, exac_fin_maf: float64, exac_maf: float64, exac_nfe_allele: str, exac_nfe_maf: float64, exac_oth_allele: str, exac_oth_maf: float64, exac_sas_allele: str, exac_sas_maf: float64, id: str, minor_allele: str, minor_allele_freq: float64, phenotype_or_disease: int32, pubmed: array<int32>, sas_allele: str, sas_maf: float64, somatic: int32, start: int32, strand: int32}> str int32 str str array<struct{allele_num: int32, consequence_terms: array<str>, impact: str, minimised: int32, variant_allele: str}> str array<struct{allele_num: int32, consequence_terms: array<str>, high_inf_pos: str, impact: str, minimised: int32, motif_feature_id: str, motif_name: str, motif_pos: int32, motif_score_change: float64, strand: int32, variant_allele: str}> array<struct{allele_num: int32, biotype: str, consequence_terms: array<str>, impact: str, minimised: int32, regulatory_feature_id: str, variant_allele: str}> str int32 int32 array<struct{allele_num: int32, amino_acids: str, appris: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, tsl: int32, uniparc: str, variant_allele: str}> str int32 int32
chr1:12719 ["G","C"] 1.96e-02 1.35e-02 9.08e-01 3 True False [("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")] "GRCh38" "G/C" NA [(NA,NA,NA,NA,"G/C",NA,NA,NA,12719,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs1410641955",NA,NA,NA,NA,NA,NA,NA,12719,1)] NA 12719 "." "chr1 12719 . G C . . GT" NA "splice_region_variant" NA NA "chr1" 12719 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000450305.2:n.182+22G>C",NA,NA,"MODIFIER","3/5",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"C"),(1,NA,NA,"processed_transcript",1,NA,466,466,NA,NA,NA,["splice_region_variant","non_coding_transcript_exon_variant"],NA,NA,"2/3","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000456328.2:n.466G>C",NA,NA,"LOW",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"C"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],1685,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"C"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],4650,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"C")] "SNV" 0 0
chr1:13380 ["C","G"] -3.00e-01 9.08e-01 3.41e-01 5 True False [("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")] "GRCh38" "C/G" NA [(NA,NA,NA,NA,"C/G",NA,NA,NA,13380,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs571093408","G",8.20e-03,NA,NA,NA,NA,NA,13380,1)] NA 13380 "." "chr1 13380 . C G . . GT" NA "splice_region_variant" [(1,["TF_binding_site_variant"],"N","MODIFIER",NA,"ENSM00080464946","ENSPFM0257",19,-2.10e-02,-1,"G")] [(1,"promoter_flanking_region",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00001164745","G")] "chr1" 13380 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["splice_region_variant","intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000450305.2:n.414+6C>G",NA,NA,"LOW","5/5",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,628,628,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"3/3","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000456328.2:n.628C>G",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],1024,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3989,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")] "SNV" 0 0
chr1:13494 ["A","G"] -3.33e-01 6.83e-01 4.09e-01 5 True False [("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")] "GRCh38" "A/G" NA [(NA,NA,NA,NA,"A/G",NA,NA,NA,13494,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs574697788","G",1.40e-03,NA,NA,NA,NA,NA,13494,1)] NA 13494 "." "chr1 13494 . A G . . GT" NA "non_coding_transcript_exon_variant" NA [(1,"CTCF_binding_site",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00000344265","G"),(1,"promoter_flanking_region",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00001164745","G")] "chr1" 13494 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,456,456,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"6/6","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000450305.2:n.456A>G",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,742,742,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"3/3","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000456328.2:n.742A>G",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],910,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3875,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")] "SNV" 0 0
chr1:13504 ["G","A"] 1.50e-01 2.84e-01 5.94e-01 5 True False [("ENSG00000223972.5","ENSG00000223972.5","DDX11L1","transcribed_unprocessed_pseudogene","2","gene",".","+",".")] "GRCh38" "G/A" NA [(NA,NA,NA,NA,"G/A",NA,NA,NA,13504,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs199896944",NA,NA,NA,NA,NA,NA,NA,13504,1)] NA 13504 "." "chr1 13504 . G A . . GT" NA "non_coding_transcript_exon_variant" NA [(1,"CTCF_binding_site",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00000344265","A"),(1,"promoter_flanking_region",["regulatory_region_variant"],"MODIFIER",NA,"ENSR00001164745","A")] "chr1" 13504 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,466,466,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"6/6","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000450305.2:n.466G>A",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"A"),(1,NA,NA,"processed_transcript",1,NA,752,752,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"3/3","ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102","ENST00000456328.2:n.752G>A",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"A"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],900,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"A"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3865,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"A")] "SNV" 0 0
chr1:17358 ["ACTT","A"] 4.46e-01 1.30e+00 2.54e-01 5 True False [("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")] "GRCh38" "CTT/-" NA [(NA,NA,NA,NA,"CTTCTTCT/CTTCT",NA,NA,NA,17366,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs749387668",NA,NA,NA,NA,NA,NA,NA,17359,1)] NA 17361 "." "chr1 17358 . ACTT A . . GT" NA "non_coding_transcript_exon_variant" NA NA "chr1" 17359 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3689,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"-"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2950,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"-"),(1,NA,NA,"unprocessed_pseudogene",1,NA,582,584,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"6/11","ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.582_584del",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"-"),(1,NA,NA,"miRNA",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],8,NA,NA,"ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"-")] "deletion" 0 0
chr1:17373 ["A","G"] -3.88e-02 4.94e-02 8.24e-01 4 True False [("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")] "GRCh38" "A/G" NA [(NA,NA,NA,NA,"A/G",NA,NA,NA,17373,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs750111615",NA,NA,NA,NA,NA,NA,NA,17373,1)] NA 17373 "." "chr1 17373 . A G . . GT" NA "splice_region_variant" NA NA "chr1" 17373 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3703,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2964,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["splice_region_variant","intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.575-5T>C",NA,NA,"LOW","5/10",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,64,64,NA,NA,NA,["mature_miRNA_variant"],NA,NA,"1/1","ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039","ENST00000619216.1:n.64T>C",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")] "SNV" 0 0
chr1:17379 ["G","A"] 4.65e-01 2.92e+00 8.72e-02 4 True False [("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")] "GRCh38" "G/A" NA [(NA,NA,NA,NA,"G/A",NA,NA,NA,17379,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs754322362",NA,NA,NA,NA,NA,NA,NA,17379,1)] NA 17379 "." "chr1 17379 . G A . . GT" NA "mature_miRNA_variant" NA NA "chr1" 17379 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3709,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"A"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2970,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"A"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.575-11C>T",NA,NA,"MODIFIER","5/10",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"A"),(1,NA,NA,"miRNA",1,NA,58,58,NA,NA,NA,["mature_miRNA_variant"],NA,NA,"1/1","ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039","ENST00000619216.1:n.58C>T",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"A")] "SNV" 0 0
chr1:17403 ["A","G"] 1.51e-01 1.41e-01 7.07e-01 5 True False [("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")] "GRCh38" "A/G" NA [(NA,NA,NA,NA,"A/G",NA,NA,NA,17403,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs111588939",NA,NA,NA,NA,NA,NA,NA,17403,1)] NA 17403 "." "chr1 17403 . A G . . GT" NA "non_coding_transcript_exon_variant" NA NA "chr1" 17403 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3733,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2994,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.575-35T>C",NA,NA,"MODIFIER","5/10",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,34,34,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"1/1","ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039","ENST00000619216.1:n.34T>C",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")] "SNV" 0 0
chr1:17408 ["C","G"] -8.03e-02 1.80e-01 6.71e-01 4 True False [("ENSG00000278267.1","ENSG00000278267.1","MIR6859-1","miRNA","3","gene",".","-","."),("ENSG00000227232.5","ENSG00000227232.5","WASH7P","unprocessed_pseudogene","2","gene",".","-",".")] "GRCh38" "C/G" NA [(NA,NA,NA,NA,"C/G",NA,NA,NA,17408,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs747093451",NA,NA,NA,NA,NA,NA,NA,17408,1)] NA 17408 "." "chr1 17408 . C G . . GT" NA "non_coding_transcript_exon_variant" NA NA "chr1" 17408 1 [(1,NA,NA,"transcribed_unprocessed_pseudogene",NA,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],3738,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000450305",NA,NA,NA,"G"),(1,NA,NA,"processed_transcript",1,NA,NA,NA,NA,NA,NA,["downstream_gene_variant"],2999,NA,NA,"ENSG00000223972",NA,"DDX11L1","HGNC","HGNC:37102",NA,NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,"ENST00000456328",NA,1,NA,"G"),(1,NA,NA,"unprocessed_pseudogene",1,NA,NA,NA,NA,NA,NA,["intron_variant","non_coding_transcript_variant"],NA,NA,NA,"ENSG00000227232",NA,"WASH7P","HGNC","HGNC:38034","ENST00000488147.1:n.575-40G>C",NA,NA,"MODIFIER","5/10",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000488147",NA,NA,NA,"G"),(1,NA,NA,"miRNA",1,NA,29,29,NA,NA,NA,["non_coding_transcript_exon_variant"],NA,NA,"1/1","ENSG00000278267",NA,"MIR6859-1","HGNC","HGNC:50039","ENST00000619216.1:n.29G>C",NA,NA,"MODIFIER",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,-1,NA,"ENST00000619216",NA,NA,NA,"G")] "SNV" 0 0
chr1:69735 ["A","G"] 7.29e-02 3.18e-02 8.59e-01 5 True False [("ENSG00000186092.6","ENSG00000186092.6","OR4F5","protein_coding","2","gene",".","+",".")] "GRCh38" "A/G" NA [(NA,NA,NA,NA,"A/G",NA,NA,NA,69735,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"rs1245478362",NA,NA,NA,NA,NA,NA,NA,69735,1)] NA 69735 "." "chr1 69735 . A G . . GT" NA "synonymous_variant" NA NA "chr1" 69735 1 [(1,"L","P2","protein_coding",NA,"CCDS30547.1",681,681,645,645,"ctA/ctG",["synonymous_variant"],NA,[("Gene3D","1"),("Pfam","PF13853"),("PROSITE_profiles","PS50262"),("PANTHER","PTHR26451"),("PANTHER","PTHR26451"),("Superfamily","SSF81321"),("Transmembrane_helices","TMhelix"),("CDD","cd15226")],"1/1","ENSG00000186092",NA,"OR4F5","HGNC","HGNC:14825","ENST00000335137.4:c.645A>G","ENSP00000334393.3:p.Leu215=",NA,"LOW",NA,NA,NA,NA,NA,NA,NA,NA,215,215,"ENSP00000334393",NA,NA,1,NA,"ENST00000335137",NA,NA,NA,"G"),(1,"L","A2","protein_coding",1,NA,768,768,708,708,"ctA/ctG",["synonymous_variant"],NA,[("Gene3D","1"),("PROSITE_profiles","PS50262"),("Transmembrane_helices","TMhelix"),("Superfamily","SSF81321"),("PANTHER","PTHR26451"),("PANTHER","PTHR26451"),("Pfam","PF13853"),("CDD","cd15226")],"3/3","ENSG00000186092",NA,"OR4F5","HGNC","HGNC:14825","ENST00000641515.2:c.708A>G","ENSP00000493376.2:p.Leu236=",NA,"LOW",NA,NA,NA,NA,NA,NA,NA,NA,236,236,"ENSP00000493376",NA,NA,1,NA,"ENST00000641515",NA,NA,NA,"G")] "SNV" 0 0

showing top 10 rows

7.11 Export GWAS Table, Export MT to BGEN

We can convert our Hail GWAS table and bring it into memory using the .to_pandas() method. Then we can write it to HDFS using .to_csv().

# Convert to Pandas DataFrame
ann_gwas_pd = ann_gwas_tb.to_pandas()
# Save as CSV file to Hadoop File System
ann_gwas_pd.to_csv("gwas-results.csv")

Our gwas-results.csv file is now in the Hadoop File System. Then we can run hdfs get to pull it onto the disk of the Driver node. Then we can use dx upload to get it into project storage.

%%bash

#Fetch results file from Hadoop File System and save to Driver Node Storage
hdfs dfs -get gwas-results.csv
#Upload resutls file to Project Storage
dx upload gwas-results.csv --destination "/users/tladeras/"

If we want to output BGEN files, there is no direct way for the notebook to write data into the project, so we will first write into HDFS (see https://documentation.dnanexus.com/developer/apps/developing-spark-apps#spark-cluster-management-software).

After writing out the BGEN files to HDFS, we can then move the data to the project in the next step as above.

Note that we can also do this in a single step by using dxFUSE in -limitedWrite mode.

Additional documentation: https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_bgen

# Create a set of unique chromosomes found in MT

chr_set = mt.aggregate_rows(hl.agg.collect_as_set(mt.locus.contig))
# Filter MT to a single chromosome and write out as BGEN file to HDFS as a single file for each chromosome in the MT
for chrom in chr_set:
    filtered_mt = hl.filter_intervals(mt, [hl.parse_locus_interval(chrom, reference_genome="GRCh38"),])
    hl.export_bgen(filtered_mt, f"{chrom}")
%%bash
# Copy BGEN files from HDFS to the JupyterLab execution environment file system

hdfs dfs -get ./*.bgen .
%%bash
# Copy SAMPLE files from HDFS to the JupyterLab execution environment file system

hdfs dfs -get ./*.sample .
%%bash
# Upload BGEN and SAMPLE files to project

dx upload *.bgen
dx upload *.sample

Finally, we can save our filtered MatrixTable to dnax so we can use it later.

# Store Table in DNAXc

import dxpy
db_name = "db_mt_test2"
tb_name = "filtered.mt"

# find database ID of newly created database using a dxpy methodc
db_uri = dxpy.find_one_data_object(name=f"{db_name}", classname="database")['id']
url = f"dnax://{db_uri}/{tb_name}"

# Before this step, the Hail Table is just an object in memory. To persist it and be able to access 
# it later, the notebook needs to write it into a persistent filesystem (in this case DNAX).
# See https://hail.is/docs/0.2/hail.Table.html#hail.Table.write for additional documentation.
qc_tb.write(url) # Note: output should describe size of Table (i.e. number of rows, partitions)