Note: This functionality is currently experimental. I’m working to get it working on UK Biobank RAP right now.

Using xvhelper to download and decode pheno datasets

See the Jupyter Notebook version here

If we are running JupyterLab or RStudio on the RAP/DNAnexus platform, we can use xvhelper to automate our downloading and decoding of the Pheno Data.

Install xvhelper if necessary

install.packages("remotes")
remotes::install_github("laderast/xvhelper")

UKB RAP RStudio only

The RStudio version on UKB RAP needs an updated dx-toolkit to use this functionality. You can run the code below to update it and install pandas.

pip3 install dxpy==0.354.0
pip3 install pandas
reticulate::use_python("/usr/bin/python3")

Starting Out

Start with the Datasets

We can first use find_all_datasets() to return a data.frame of all datasets available in our project.

datasets <- find_all_datasets()
datasets
##                                id                          name
## 1 record-G406j8j0x8kzxv3G08k64gVV apollo_ukrap_synth_pheno_100k
##                            project
## 1 project-GY19Qz00Yq34kBPz8jj0XKg0

On UKB RAP, you’ll see that the dataset name follows the following convention:

{application_id}_{date_dispensed}.dataset

We will use the latest dataset, which is the top row (well, there is only one dataset in our project, but if you do multiple dispensals in your project you will have multiple datasets). We can also find this by using find_dataset_id, which will give us the last dataset dispensed:

ds_id <- find_dataset_id()
ds_id
## project-GY19Qz00Yq34kBPz8jj0XKg0:record-G406j8j0x8kzxv3G08k64gVV

Now we have our project/dataset id, we can use it to grab metadata. We’l first fetch the dictionaries for our particular dataset.

## → running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G406j8j0x8kzxv3G08k64gVV --dump-dataset-dictionary
##  Data dictionary is downloaded as /Users/tladeras/Code/xvhelper/vignettes/articles/apollo_ukrap_synth_pheno_100k.dataset.data_dictionary.csv
##  Coding dictionary is downloaded as /Users/tladeras/Code/xvhelper/vignettes/articles/apollo_ukrap_synth_pheno_100k.dataset.codings.csv
##  Entity dictionary is downloaded as /Users/tladeras/Code/xvhelper/vignettes/articles/apollo_ukrap_synth_pheno_100k.entity_dictionary.csv

Now that we have the dictionary files into our JupyterLab/RStudio storage, we can extract the coding/data dictionary, which we’ll use in our decoding.

codings <- get_coding_table(ds_id)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
head(codings)
## # A tibble: 6 × 9
##   title        ent_field entity name  coding_name code  meaning is_sparse_coding
##   <chr>        <glue>    <chr>  <chr> <chr>       <chr> <chr>   <chr>           
## 1 Workplace v… particip… parti… p226… data_codin… -121  Do not… NA              
## 2 Workplace v… particip… parti… p226… data_codin… -131  Someti… NA              
## 3 Workplace v… particip… parti… p226… data_codin… -141  Often   NA              
## 4 Workplace v… particip… parti… p226… data_codin… 0     Rarely… NA              
## 5 Ever taken … particip… parti… p278… data_codin… 1     Yes     NA              
## 6 Ever taken … particip… parti… p278… data_codin… -1    Do not… NA              
## # ℹ 1 more variable: is_multi_select <chr>

In the next step, we’ll need a list of fields. We can explore the list of fields as a searchable table using explore_field_list():

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

Extracting Data

Now that we have the dataset id, we can extract the data into our RStudio Project. By default, extract_data() will save the data as a file into our current working directory.

fields <- c("participant.eid", "participant.p31", "participant.p41202")
extract_data(ds_id, fields)
## → running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G406j8j0x8kzxv3G08k64gVV --fields participant.eid,participant.p31,participant.p41202 -o apollo_ukrap_synth_pheno_100k.data.csv
##  data is now extracted to /Users/tladeras/Code/xvhelper/vignettes/articles/apollo_ukrap_synth_pheno_100k.data.csv

Let’s read in the data file in.

ds_name <- get_name_from_full_id(ds_id)

ds_file <- glue::glue("{ds_name}.data.csv")

data <- readr::read_csv(ds_file, show_col_types = FALSE)
head(data)
## # A tibble: 6 × 3
##   participant.eid participant.p31 participant.p41202                            
##   <chr>                     <dbl> <chr>                                         
## 1 sample_100_101                1 "[\"K297\",\"I802\",\"K29\",\"Block K20-K31\"…
## 2 sample_100_11                 0 "[\"I251\",\"K409\",\"I25\",\"Block I20-I25\"…
## 3 sample_100_110                1 "[\"M5456\",\"K635\",\"S7200\",\"K083\",\"M54…
## 4 sample_100_116                0 "[\"R073\",\"Z099\",\"I839\",\"Z305\",\"R07\"…
## 5 sample_100_124                0 "[\"K298\",\"M2557\",\"K29\",\"Block K20-K31\…
## 6 sample_100_126                0 "[\"H259\",\"G473\",\"I839\",\"Z035\",\"K529\…
decoded_data <- data[1:5000,] |>
  decode_single(codings) |>
  decode_multi_large_df(codings) |>
  decode_column_names(codings) 
head(decoded_data)
## # A tibble: 6 × 3
##   participant_id sex    diagnoses_main_icd10                                    
##   <chr>          <chr>  <chr>                                                   
## 1 sample_100_101 Male   K29.7 Gastritis, unspecified|I80.2 Phlebitis and thromb…
## 2 sample_100_11  Female I25.1 Atherosclerotic heart disease|K40.9 Unilateral or…
## 3 sample_100_110 Male   M54.56 Low back pain (Lumbar region)|K63.5 Polyp of col…
## 4 sample_100_116 Female R07.3 Other chest pain|Z09.9 Follow-up examination afte…
## 5 sample_100_124 Female K29.8 Duodenitis|M25.57 Pain in joint (Ankle and foot)|…
## 6 sample_100_126 Female H25.9 Senile cataract, unspecified|G47.3 Sleep apnoea|I…

Reading in Cohort Information

Working with cohorts is very similar to working with the entire dataset. Let’s list the cohorts in our project:

cohorts <- find_all_cohorts()
cohorts
##                                id               name
## 1 record-G5Ky4f008KQZZ6bx0yYz44fB female_control_3.0
## 2 record-G5Ky4Gj08KQYQ4P810fJ8qPp  female_coffee_3.0
##                            project
## 1 project-GY19Qz00Yq34kBPz8jj0XKg0
## 2 project-GY19Qz00Yq34kBPz8jj0XKg0
##                                                     project_record
## 1 project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4f008KQZZ6bx0yYz44fB
## 2 project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4Gj08KQYQ4P810fJ8qPp

Once we have the cohort record IDs, we can use extract_data() to extract the cohorts to our project.

fields <- c("participant.eid", "participant.p31", "participant.p41202")
cohort_id <- cohorts$project_record[1]
extract_data(cohort_id, fields)
## → running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4f008KQZZ6bx0yYz44fB --fields participant.eid,participant.p31,participant.p41202 -o female_control_3.0.data.csv
##  data is now extracted to /Users/tladeras/Code/xvhelper/vignettes/articles/female_control_3.0.data.csv
cohort1 <- readr::read_csv("female_control_3.0.data.csv")
## Rows: 37206 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): participant.eid, participant.p41202
## dbl (1): participant.p31
## 
##  Use `spec()` to retrieve the full column specification for this data.
##  Specify the column types or set `show_col_types = FALSE` to quiet this message.
knitr::kable(cohort1[1:10,])
participant.eid participant.p31 participant.p41202
sample_100_11 0 [“I251”,“K409”,“I25”,“Block I20-I25”,“Chapter IX”,“K40”,“Block K40-K46”,“Chapter XI”]
sample_100_124 0 [“K298”,“M2557”,“K29”,“Block K20-K31”,“Chapter XI”,“M255”,“M25”,“Block M20-M25”,“Chapter XIII”]
sample_100_126 0 [“H259”,“G473”,“I839”,“Z035”,“K529”,“H25”,“Block H25-H28”,“Chapter VII”,“G47”,“Block G40-G47”,“Chapter VI”,“I83”,“Block I80-I89”,“Chapter IX”,“Z03”,“Block Z00-Z13”,“Chapter XXI”,“K52”,“Block K50-K52”,“Chapter XI”]
sample_100_127 0 [“C773”,“C77”,“Block C76-C80”,“Chapter II”]
sample_100_138 0 [“I635”,“I63”,“Block I60-I69”,“Chapter IX”]
sample_100_141 0 [“R11”,“Block R10-R19”,“Chapter XVIII”]
sample_100_150 0 [“I849”,“A090”,“H021”,“R103”,“K529”,“I84”,“Block I80-I89”,“Chapter IX”,“A09”,“Block A00-A09”,“Chapter I”,“H02”,“Block H00-H06”,“Chapter VII”,“R10”,“Block R10-R19”,“Chapter XVIII”,“K52”,“Block K50-K52”,“Chapter XI”]
sample_100_155 0 [“B349”,“H358”,“K253”,“Z018”,“B34”,“Block B25-B34”,“Chapter I”,“H35”,“Block H30-H36”,“Chapter VII”,“K25”,“Block K20-K31”,“Chapter XI”,“Z01”,“Block Z00-Z13”,“Chapter XXI”]
sample_100_170 0 [“M2556”,“T840”,“K083”,“M255”,“M25”,“Block M20-M25”,“Chapter XIII”,“T84”,“Block T80-T88”,“Chapter XIX”,“K08”,“Block K00-K14”,“Chapter XI”]
sample_100_203 0 [“D170”,“Z038”,“Z035”,“J320”,“K801”,“N319”,“N40”,“D17”,“Block D10-D36”,“Chapter II”,“Z03”,“Block Z00-Z13”,“Chapter XXI”,“J32”,“Block J30-J39”,“Chapter X”,“K80”,“Block K80-K87”,“Chapter XI”,“N31”,“Block N30-N39”,“Chapter XIV”,“Block N40-N51”]

We can decode our cohort in the same way:

cohort1[1:10,] |>
  decode_single(codings) |>
  decode_multi_purrr(codings) |>
  decode_column_names(codings)
## # A tibble: 10 × 3
##    participant_id sex    diagnoses_main_icd10                                   
##    <chr>          <chr>  <chr>                                                  
##  1 sample_100_11  Female I25.1 Atherosclerotic heart disease|K40.9 Unilateral o…
##  2 sample_100_124 Female K29.8 Duodenitis|M25.57 Pain in joint (Ankle and foot)…
##  3 sample_100_126 Female H25.9 Senile cataract, unspecified|G47.3 Sleep apnoea|…
##  4 sample_100_127 Female C77.3 Axillary and upper limb lymph nodes|C77 Secondar…
##  5 sample_100_138 Female I63.5 Cerebral infarction due to unspecified occlusion…
##  6 sample_100_141 Female R11 Nausea and vomiting|R10-R19 Symptoms and signs inv…
##  7 sample_100_150 Female I84.9 Unspecified haemorrhoids without complication|A0…
##  8 sample_100_155 Female B34.9 Viral infection, unspecified|H35.8 Other specifi…
##  9 sample_100_170 Female M25.56 Pain in joint (Lower leg)|T84.0 Mechanical comp…
## 10 sample_100_203 Female D17.0 Benign lipomatous neoplasm of skin and subcutane…

Limitations

extract_data() is based on dx extract_dataset from the dx-toolkit. There are some limitations with queries, since it uses a shared resource called the Thrift server to extract data. You may get an error if you try to extract more than ~15 fields at once.

I am working on functions that run Table Exporter for much larger sets of fields.