vignettes/articles/downloading-pheno-data.Rmd
downloading-pheno-data.Rmd
Note: This functionality is currently experimental. I’m working to get it working on UK Biobank RAP right now.
xvhelper
to download and decode pheno
datasets
See the Jupyter Notebook version here
If we are running JupyterLab or RStudio on the RAP/DNAnexus platform,
we can use xvhelper
to automate our downloading and
decoding of the Pheno Data.
xvhelper
if necessary
install.packages("remotes")
remotes::install_github("laderast/xvhelper")
The RStudio version on UKB RAP needs an updated dx-toolkit to use this functionality. You can run the code below to update it and install pandas.
reticulate::use_python("/usr/bin/python3")
We can first use find_all_datasets()
to return a
data.frame
of all datasets available in our project.
datasets <- find_all_datasets()
datasets
## id name
## 1 record-G406j8j0x8kzxv3G08k64gVV apollo_ukrap_synth_pheno_100k
## project
## 1 project-GY19Qz00Yq34kBPz8jj0XKg0
On UKB RAP, you’ll see that the dataset name follows the following convention:
{application_id}_{date_dispensed}.dataset
We will use the latest dataset, which is the top row (well, there is
only one dataset in our project, but if you do multiple dispensals in
your project you will have multiple datasets). We can also find this by
using find_dataset_id
, which will give us the last dataset
dispensed:
ds_id <- find_dataset_id()
ds_id
## project-GY19Qz00Yq34kBPz8jj0XKg0:record-G406j8j0x8kzxv3G08k64gVV
Now we have our project/dataset id, we can use it to grab metadata. We’l first fetch the dictionaries for our particular dataset.
get_dictionaries(ds_id)
## → running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G406j8j0x8kzxv3G08k64gVV --dump-dataset-dictionary
## ✔ Data dictionary is downloaded as /Users/tladeras/Code/xvhelper/vignettes/articles/apollo_ukrap_synth_pheno_100k.dataset.data_dictionary.csv
## ✔ Coding dictionary is downloaded as /Users/tladeras/Code/xvhelper/vignettes/articles/apollo_ukrap_synth_pheno_100k.dataset.codings.csv
## ✔ Entity dictionary is downloaded as /Users/tladeras/Code/xvhelper/vignettes/articles/apollo_ukrap_synth_pheno_100k.entity_dictionary.csv
Now that we have the dictionary files into our JupyterLab/RStudio storage, we can extract the coding/data dictionary, which we’ll use in our decoding.
codings <- get_coding_table(ds_id)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
head(codings)
## # A tibble: 6 × 9
## title ent_field entity name coding_name code meaning is_sparse_coding
## <chr> <glue> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Workplace v… particip… parti… p226… data_codin… -121 Do not… NA
## 2 Workplace v… particip… parti… p226… data_codin… -131 Someti… NA
## 3 Workplace v… particip… parti… p226… data_codin… -141 Often NA
## 4 Workplace v… particip… parti… p226… data_codin… 0 Rarely… NA
## 5 Ever taken … particip… parti… p278… data_codin… 1 Yes NA
## 6 Ever taken … particip… parti… p278… data_codin… -1 Do not… NA
## # ℹ 1 more variable: is_multi_select <chr>
In the next step, we’ll need a list of fields. We can explore the
list of fields as a searchable table using
explore_field_list()
:
explore_field_list(ds_id)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
Now that we have the dataset id, we can extract the data into our
RStudio Project. By default, extract_data()
will save the
data as a file into our current working directory.
fields <- c("participant.eid", "participant.p31", "participant.p41202")
extract_data(ds_id, fields)
## → running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G406j8j0x8kzxv3G08k64gVV --fields participant.eid,participant.p31,participant.p41202 -o apollo_ukrap_synth_pheno_100k.data.csv
## ✔ data is now extracted to /Users/tladeras/Code/xvhelper/vignettes/articles/apollo_ukrap_synth_pheno_100k.data.csv
Let’s read in the data file in.
ds_name <- get_name_from_full_id(ds_id)
ds_file <- glue::glue("{ds_name}.data.csv")
data <- readr::read_csv(ds_file, show_col_types = FALSE)
head(data)
## # A tibble: 6 × 3
## participant.eid participant.p31 participant.p41202
## <chr> <dbl> <chr>
## 1 sample_100_101 1 "[\"K297\",\"I802\",\"K29\",\"Block K20-K31\"…
## 2 sample_100_11 0 "[\"I251\",\"K409\",\"I25\",\"Block I20-I25\"…
## 3 sample_100_110 1 "[\"M5456\",\"K635\",\"S7200\",\"K083\",\"M54…
## 4 sample_100_116 0 "[\"R073\",\"Z099\",\"I839\",\"Z305\",\"R07\"…
## 5 sample_100_124 0 "[\"K298\",\"M2557\",\"K29\",\"Block K20-K31\…
## 6 sample_100_126 0 "[\"H259\",\"G473\",\"I839\",\"Z035\",\"K529\…
decoded_data <- data[1:5000,] |>
decode_single(codings) |>
decode_multi_large_df(codings) |>
decode_column_names(codings)
head(decoded_data)
## # A tibble: 6 × 3
## participant_id sex diagnoses_main_icd10
## <chr> <chr> <chr>
## 1 sample_100_101 Male K29.7 Gastritis, unspecified|I80.2 Phlebitis and thromb…
## 2 sample_100_11 Female I25.1 Atherosclerotic heart disease|K40.9 Unilateral or…
## 3 sample_100_110 Male M54.56 Low back pain (Lumbar region)|K63.5 Polyp of col…
## 4 sample_100_116 Female R07.3 Other chest pain|Z09.9 Follow-up examination afte…
## 5 sample_100_124 Female K29.8 Duodenitis|M25.57 Pain in joint (Ankle and foot)|…
## 6 sample_100_126 Female H25.9 Senile cataract, unspecified|G47.3 Sleep apnoea|I…
Working with cohorts is very similar to working with the entire dataset. Let’s list the cohorts in our project:
cohorts <- find_all_cohorts()
cohorts
## id name
## 1 record-G5Ky4f008KQZZ6bx0yYz44fB female_control_3.0
## 2 record-G5Ky4Gj08KQYQ4P810fJ8qPp female_coffee_3.0
## project
## 1 project-GY19Qz00Yq34kBPz8jj0XKg0
## 2 project-GY19Qz00Yq34kBPz8jj0XKg0
## project_record
## 1 project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4f008KQZZ6bx0yYz44fB
## 2 project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4Gj08KQYQ4P810fJ8qPp
Once we have the cohort record
IDs, we can use
extract_data()
to extract the cohorts to our project.
fields <- c("participant.eid", "participant.p31", "participant.p41202")
cohort_id <- cohorts$project_record[1]
extract_data(cohort_id, fields)
## → running dx extract_dataset project-GY19Qz00Yq34kBPz8jj0XKg0:record-G5Ky4f008KQZZ6bx0yYz44fB --fields participant.eid,participant.p31,participant.p41202 -o female_control_3.0.data.csv
## ✔ data is now extracted to /Users/tladeras/Code/xvhelper/vignettes/articles/female_control_3.0.data.csv
cohort1 <- readr::read_csv("female_control_3.0.data.csv")
## Rows: 37206 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): participant.eid, participant.p41202
## dbl (1): participant.p31
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
knitr::kable(cohort1[1:10,])
participant.eid | participant.p31 | participant.p41202 |
---|---|---|
sample_100_11 | 0 | [“I251”,“K409”,“I25”,“Block I20-I25”,“Chapter IX”,“K40”,“Block K40-K46”,“Chapter XI”] |
sample_100_124 | 0 | [“K298”,“M2557”,“K29”,“Block K20-K31”,“Chapter XI”,“M255”,“M25”,“Block M20-M25”,“Chapter XIII”] |
sample_100_126 | 0 | [“H259”,“G473”,“I839”,“Z035”,“K529”,“H25”,“Block H25-H28”,“Chapter VII”,“G47”,“Block G40-G47”,“Chapter VI”,“I83”,“Block I80-I89”,“Chapter IX”,“Z03”,“Block Z00-Z13”,“Chapter XXI”,“K52”,“Block K50-K52”,“Chapter XI”] |
sample_100_127 | 0 | [“C773”,“C77”,“Block C76-C80”,“Chapter II”] |
sample_100_138 | 0 | [“I635”,“I63”,“Block I60-I69”,“Chapter IX”] |
sample_100_141 | 0 | [“R11”,“Block R10-R19”,“Chapter XVIII”] |
sample_100_150 | 0 | [“I849”,“A090”,“H021”,“R103”,“K529”,“I84”,“Block I80-I89”,“Chapter IX”,“A09”,“Block A00-A09”,“Chapter I”,“H02”,“Block H00-H06”,“Chapter VII”,“R10”,“Block R10-R19”,“Chapter XVIII”,“K52”,“Block K50-K52”,“Chapter XI”] |
sample_100_155 | 0 | [“B349”,“H358”,“K253”,“Z018”,“B34”,“Block B25-B34”,“Chapter I”,“H35”,“Block H30-H36”,“Chapter VII”,“K25”,“Block K20-K31”,“Chapter XI”,“Z01”,“Block Z00-Z13”,“Chapter XXI”] |
sample_100_170 | 0 | [“M2556”,“T840”,“K083”,“M255”,“M25”,“Block M20-M25”,“Chapter XIII”,“T84”,“Block T80-T88”,“Chapter XIX”,“K08”,“Block K00-K14”,“Chapter XI”] |
sample_100_203 | 0 | [“D170”,“Z038”,“Z035”,“J320”,“K801”,“N319”,“N40”,“D17”,“Block D10-D36”,“Chapter II”,“Z03”,“Block Z00-Z13”,“Chapter XXI”,“J32”,“Block J30-J39”,“Chapter X”,“K80”,“Block K80-K87”,“Chapter XI”,“N31”,“Block N30-N39”,“Chapter XIV”,“Block N40-N51”] |
We can decode our cohort in the same way:
cohort1[1:10,] |>
decode_single(codings) |>
decode_multi_purrr(codings) |>
decode_column_names(codings)
## # A tibble: 10 × 3
## participant_id sex diagnoses_main_icd10
## <chr> <chr> <chr>
## 1 sample_100_11 Female I25.1 Atherosclerotic heart disease|K40.9 Unilateral o…
## 2 sample_100_124 Female K29.8 Duodenitis|M25.57 Pain in joint (Ankle and foot)…
## 3 sample_100_126 Female H25.9 Senile cataract, unspecified|G47.3 Sleep apnoea|…
## 4 sample_100_127 Female C77.3 Axillary and upper limb lymph nodes|C77 Secondar…
## 5 sample_100_138 Female I63.5 Cerebral infarction due to unspecified occlusion…
## 6 sample_100_141 Female R11 Nausea and vomiting|R10-R19 Symptoms and signs inv…
## 7 sample_100_150 Female I84.9 Unspecified haemorrhoids without complication|A0…
## 8 sample_100_155 Female B34.9 Viral infection, unspecified|H35.8 Other specifi…
## 9 sample_100_170 Female M25.56 Pain in joint (Lower leg)|T84.0 Mechanical comp…
## 10 sample_100_203 Female D17.0 Benign lipomatous neoplasm of skin and subcutane…
extract_data()
is based on
dx extract_dataset
from the dx-toolkit
. There
are some limitations with queries, since it uses a shared resource
called the Thrift server to extract data. You may get an error if you
try to extract more than ~15 fields at once.
I am working on functions that run Table Exporter
for
much larger sets of fields.