system("pip3 show dxpy", intern = TRUE)
5 dx extract_dataset
in R
As-Is Software Disclaimer
This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.
This notebook demonstrates usage of the dx command extract_dataset
for: * Retrieval of Apollo-stored data, as referenced within entities and fields of a Dataset or Cohort object on the platform * Retrieval of the underlying data dictionary files used to generate a Dataset object on the platform
MIT License applies to this notebook.
5.1 Preparing your environment
5.1.1 Launch spec:
- App name: JupyterLab with Python, R, Stata, ML ()
- Kernel: R
- Instance type: Spark Cluster - mem1_ssd1_v2_x2, 4 nodes
- Snapshot:
/.Notebook_snapshots/jupyter_snapshot.gz
- Cost: < $0.2
- Runtime: =~ 10 min
- Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID
5.1.2 Install dxpy
extract_dataset requires dxpy version >= 0.329.0. If running the command from your local environment (i.e. off of the DNAnexus platform), it may be required to also install pandas. For example, pip3 install -U dxpy[pandas]
5.1.3 Install tidyverse for data processing
Quick note - you will need to read the licenses for the tidyverse in order to make sure whether you and your group are comfortable with the licensing terms.
If you loaded the snapshot in this project, all of these packages and dependencies are also installed, so you don’t need to install them again.
install.packages(c("readr", "stringr", "dplyr", "glue", "reactable", "janitor", "remotes"))
::install_github("laderast/xvhelper") remotes
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘bit’, ‘bit64’, ‘progress’, ‘hms’, ‘vroom’, ‘tzdb’, ‘vctrs’, ‘reactR’, ‘snakecase’
Downloading GitHub repo laderast/xvhelper@HEAD
Rcpp (1.0.9 -> 1.0.10) [CRAN]
utf8 (1.2.2 -> 1.2.3 ) [CRAN]
fansi (1.0.3 -> 1.0.4 ) [CRAN]
stringi (1.7.8 -> 1.7.12) [CRAN]
tibble (3.1.8 -> 3.2.0 ) [CRAN]
purrr (0.3.5 -> 1.0.1 ) [CRAN]
cli (3.4.1 -> 3.6.0 ) [CRAN]
timechange (0.1.1 -> 0.2.0 ) [CRAN]
png (NA -> 0.1-8 ) [CRAN]
here (NA -> 1.0.1 ) [CRAN]
RcppTOML (NA -> 0.2.2 ) [CRAN]
tidyr (1.2.1 -> 1.3.0 ) [CRAN]
lubridate (1.9.0 -> 1.9.2 ) [CRAN]
reticulate (NA -> 1.28 ) [CRAN]
Installing 14 packages: Rcpp, utf8, fansi, stringi, tibble, purrr, cli, timechange, png, here, RcppTOML, tidyr, lubridate, reticulate
Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/tmp/RtmpJskvRy/remotes734410c4e0/laderast-xvhelper-6e0873a/DESCRIPTION’ ... OK
* preparing ‘xvhelper’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘xvhelper_0.0.100.tar.gz’
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
5.1.4 Import packages
library(dplyr)
library(readr)
library(stringr)
library(glue)
library(reactable)
library(xvhelper)
5.1.5 1. Assign environment variables
# The referenced Dataset is private and provided only to demonstrate an example input. The user will need to supply a permissible and valid record-id
# Assign project-id of dataset
# Assign dataset record-id
<- 'record-G5Ky4Gj08KQYQ4P810fJ8qPp'
rid # Assign joint dataset project-id:record-id
<- rid dataset
5.1.6 2. Call “dx extract_dataset” using a supplied dataset
We’ll use the {glue}
package to put our bash commands together for dx extract_dataset
, and use system()
to execute our bash code.
glue::glue()
has the advantage of not needing to paste()
together strings. The string substitution is cleaner.
<- glue::glue("dx extract_dataset {dataset} -ddd")
cmd
cmd
Let’s execute our command using system()
and then we will list the files that result using list.files()
. We generate three files in the directory in JupyterLab storage:
- dataset_name
.codings.csv
- dataset_name
.data_dictionary.csv
- dataset_name
.entity_dictionary.csv
system(cmd)
list.files()
5.1.6.1 Preview data in the three dictionary (*.csv) files
#codings_file <- system("ls *.codings.csv", intern = TRUE)
<- list.files(pattern="*.codings.csv")
codings_file <- read_csv(codings_file, show_col_types = FALSE)
codings_df head(codings_df)
<- system("ls *.entity_dictionary.csv", intern=TRUE)
entity_dict_file <- read_csv(entity_dict_file, show_col_types = FALSE)
entity_dict_df head(entity_dict_df)
5.1.7 Understanding the Data Dictionary File
The data dictionary is the glue for the entire dataset. It maps:
- Entity to Fields
- Fields to Codings
- Entity to Entity
We’ll use the data dictionary to understand how to building our list of fields, and later, we’ll join it to the codings file to build a list of fields and their coded values.
There are more columns to the data dictionary, but let’s first see the entity
, name
, title
, and type
columns:
#data_dict_file <- system("ls *.data_dictionary.csv", intern=TRUE)
<- list.files(pattern="*.data_dictionary.csv")
data_dict_file <- read_csv(data_dict_file, show_col_types = FALSE)
data_dict_df <- data_dict_df
data_dict_df
%>%
data_dict_df select(entity, name, title, type) %>%
head()
5.1.8 3. Parse returned metadata and extract entity/field names
Let’s search for some fields. We want the following fields:
Coffee intake | instance 0
Sex
(Gender)Smoked cigarette or pipe within last hour | Instance 0
We can use the {reactable}
package to make a searchable table of the data dictionary. This will help in finding fields.
Note the search box in the top right of the table - when we have many fields, we can use the search box to find fields of interest. Try searching for Coffee intake
and see what fields pop up.
<- data_dict_df %>%
data_dict_df relocate(name, title) %>%
mutate(ent_field = glue::glue("{entity}.{name}"))
<- data_dict_df |>
basic_data_dict select(title, name, entity, ent_field, coding_name, is_multi_select, is_sparse_coding)
::reactable(basic_data_dict, searchable = TRUE) reactable
Another strategy for searching fields: we can use grepl
within dplyr::filter()
to search for fields that match our criteria.
Note we’re chaining the grepl
statements with an OR |
.
We’re also concatenating entity
and name
to a new variable, ent_field
, which we’ll use when we specify our list of fields.
<- data_dict_df %>%
filtered_dict filter(grepl("Coffee type", title) |
grepl("Sex", title) |
grepl("Smoked", title) |
grepl("Age at recruitment", title) |
grepl("main ICD10", title) |
grepl("Types of transport", title)
%>%
) arrange(title)
%>%
filtered_dict select(name, title, ent_field)
Let’s use this subset of fields - we’ll pull the ent_field
column, and paste it together into a single comma delimited string using paste
:
<- filtered_dict %>%
field_list pull(ent_field)
#field_list <- field_list[200:210]
<- paste(field_list, collapse = ",")
field_list field_list
5.1.9 4. Use extracted entity and field names as input to the called function, “dx extract_dataset” and extract data
Again, we’ll use glue()
here for cleaner string substitution.
We’ll extract the cohort information to a file called cohort_data.csv
and work with this file for the rest of the notebook.
<- "record-G5Ky4Gj08KQYQ4P810fJ8qPp"
cohort <- "dx extract_dataset {cohort} --fields {field_list} -o cohort_data.csv"
cohort_template <- glue::glue(cohort_template)
cmd
cmd
system(cmd)
5.1.9.1 Preview data in the retrieved data file
We’ll see that the retrieved data contains the integer and character codes. These must be decoded (see below):
<- read_csv("cohort_data.csv", show_col_types = FALSE)
data_df head(data_df)
5.2 Decoding columns with xvhelper
xvhelper is a little R package that will return the actual values of the returned data.
To use it, you build a coded_col_df
using merge_coding_data_dict()
and then translate the categorical columns to values using decode_categories()
, and then change the column names to R friendly clean ones using decode_column_names()
.
Note that we need to run decode_df()
before we run decode_category()
#install via remotes::install_github()
#install.packages("remotes")
::install_github("laderast/xvhelper")
remotes
library(xvhelper)
<- xvhelper::merge_coding_data_dict(coding_dict = codings_df, data_dict = data_dict_df)
coded_col_df
<- data_df %>%
decoded ::decode_single(coded_col_df) |>
xvhelper::decode_multi_purrr(coded_col_df) |>
xvhelper::decode_column_names(coded_col_df, r_clean_names = FALSE)
xvhelper
head(decoded)
write.csv(decoded, file="cohort_decoded.csv")
5.2.1 Save Output to Project
system("dx upload *.csv --destination /users/tladeras/")