5 `dx extract_dataset` in R

As-Is Software Disclaimer

This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

This notebook demonstrates usage of the dx command extract_dataset for: * Retrieval of Apollo-stored data, as referenced within entities and fields of a Dataset or Cohort object on the platform * Retrieval of the underlying data dictionary files used to generate a Dataset object on the platform

MIT License applies to this notebook.

5.1 Preparing your environment

5.1.1 Launch spec:

App name: JupyterLab with Python, R, Stata, ML ()
Kernel: R
Instance type: Spark Cluster - mem1_ssd1_v2_x2, 4 nodes
Snapshot: /.Notebook_snapshots/jupyter_snapshot.gz
Cost: < $0.2
Runtime: =~ 10 min
Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID

5.1.2 Install dxpy

extract_dataset requires dxpy version >= 0.329.0. If running the command from your local environment (i.e. off of the DNAnexus platform), it may be required to also install pandas. For example, pip3 install -U dxpy[pandas]

system("pip3 show dxpy", intern = TRUE)

5.1.3 Install tidyverse for data processing

Quick note - you will need to read the licenses for the tidyverse in order to make sure whether you and your group are comfortable with the licensing terms.

If you loaded the snapshot in this project, all of these packages and dependencies are also installed, so you don’t need to install them again.

install.packages(c("readr", "stringr", "dplyr", "glue", "reactable", "janitor", "remotes"))
remotes::install_github("laderast/xvhelper")

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘bit’, ‘bit64’, ‘progress’, ‘hms’, ‘vroom’, ‘tzdb’, ‘vctrs’, ‘reactR’, ‘snakecase’


Downloading GitHub repo laderast/xvhelper@HEAD

Rcpp       (1.0.9 -> 1.0.10) [CRAN]
utf8       (1.2.2 -> 1.2.3 ) [CRAN]
fansi      (1.0.3 -> 1.0.4 ) [CRAN]
stringi    (1.7.8 -> 1.7.12) [CRAN]
tibble     (3.1.8 -> 3.2.0 ) [CRAN]
purrr      (0.3.5 -> 1.0.1 ) [CRAN]
cli        (3.4.1 -> 3.6.0 ) [CRAN]
timechange (0.1.1 -> 0.2.0 ) [CRAN]
png        (NA    -> 0.1-8 ) [CRAN]
here       (NA    -> 1.0.1 ) [CRAN]
RcppTOML   (NA    -> 0.2.2 ) [CRAN]
tidyr      (1.2.1 -> 1.3.0 ) [CRAN]
lubridate  (1.9.0 -> 1.9.2 ) [CRAN]
reticulate (NA    -> 1.28  ) [CRAN]

Installing 14 packages: Rcpp, utf8, fansi, stringi, tibble, purrr, cli, timechange, png, here, RcppTOML, tidyr, lubridate, reticulate

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/tmp/RtmpJskvRy/remotes734410c4e0/laderast-xvhelper-6e0873a/DESCRIPTION’ ... OK
* preparing ‘xvhelper’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘xvhelper_0.0.100.tar.gz’

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

5.1.4 Import packages

library(dplyr)
library(readr)
library(stringr)
library(glue)
library(reactable)
library(xvhelper)

5.1.5 1. Assign environment variables

# The referenced Dataset is private and provided only to demonstrate an example input. The user will need to supply a permissible and valid record-id
# Assign project-id of dataset
# Assign dataset record-id
rid <- 'record-G5Ky4Gj08KQYQ4P810fJ8qPp'
# Assign joint dataset project-id:record-id
dataset <- rid

5.1.6 2. Call “dx extract_dataset” using a supplied dataset

We’ll use the {glue} package to put our bash commands together for dx extract_dataset, and use system() to execute our bash code.

glue::glue() has the advantage of not needing to paste() together strings. The string substitution is cleaner.

cmd <- glue::glue("dx extract_dataset {dataset} -ddd")

cmd

Let’s execute our command using system() and then we will list the files that result using list.files(). We generate three files in the directory in JupyterLab storage:

dataset_name.codings.csv
dataset_name.data_dictionary.csv
dataset_name.entity_dictionary.csv

system(cmd)
list.files()

5.1.6.1 Preview data in the three dictionary (*.csv) files

#codings_file <- system("ls *.codings.csv", intern = TRUE)
codings_file <- list.files(pattern="*.codings.csv")
codings_df <- read_csv(codings_file, show_col_types = FALSE)
head(codings_df)

entity_dict_file <- system("ls *.entity_dictionary.csv", intern=TRUE)
entity_dict_df <- read_csv(entity_dict_file, show_col_types = FALSE)
head(entity_dict_df)

5.1.7 Understanding the Data Dictionary File

The data dictionary is the glue for the entire dataset. It maps:

Entity to Fields
Fields to Codings
Entity to Entity

We’ll use the data dictionary to understand how to building our list of fields, and later, we’ll join it to the codings file to build a list of fields and their coded values.

There are more columns to the data dictionary, but let’s first see the entity, name, title, and type columns:

#data_dict_file <- system("ls *.data_dictionary.csv", intern=TRUE)
data_dict_file <- list.files(pattern="*.data_dictionary.csv")
data_dict_df <- read_csv(data_dict_file, show_col_types = FALSE)
data_dict_df <- data_dict_df 

data_dict_df %>%
        select(entity, name, title, type) %>%
        head()

5.1.8 3. Parse returned metadata and extract entity/field names

Let’s search for some fields. We want the following fields:

Coffee intake | instance 0
Sex (Gender)
Smoked cigarette or pipe within last hour | Instance 0

We can use the {reactable} package to make a searchable table of the data dictionary. This will help in finding fields.

Note the search box in the top right of the table - when we have many fields, we can use the search box to find fields of interest. Try searching for Coffee intake and see what fields pop up.

data_dict_df <- data_dict_df %>%
    relocate(name, title) %>%
    mutate(ent_field = glue::glue("{entity}.{name}"))

basic_data_dict <- data_dict_df |>
                    select(title, name, entity, ent_field, coding_name, is_multi_select, is_sparse_coding)

reactable::reactable(basic_data_dict, searchable = TRUE)

Another strategy for searching fields: we can use grepl within dplyr::filter() to search for fields that match our criteria.

Note we’re chaining the grepl statements with an OR |.

We’re also concatenating entity and name to a new variable, ent_field, which we’ll use when we specify our list of fields.

filtered_dict <- data_dict_df %>%
    filter(grepl("Coffee type", title) | 
           grepl("Sex", title) | 
           grepl("Smoked", title) | 
           grepl("Age at recruitment", title) |
           grepl("main ICD10", title) |
           grepl("Types of transport", title)
          ) %>%
    arrange(title) 

filtered_dict %>%
    select(name, title, ent_field)

Let’s use this subset of fields - we’ll pull the ent_field column, and paste it together into a single comma delimited string using paste:

field_list <- filtered_dict %>%
    pull(ent_field)

#field_list <- field_list[200:210]
field_list <- paste(field_list, collapse = ",")
field_list

5.1.9 4. Use extracted entity and field names as input to the called function, “dx extract_dataset” and extract data

Again, we’ll use glue() here for cleaner string substitution.

We’ll extract the cohort information to a file called cohort_data.csv and work with this file for the rest of the notebook.

cohort <- "record-G5Ky4Gj08KQYQ4P810fJ8qPp"
cohort_template <- "dx extract_dataset {cohort} --fields {field_list} -o cohort_data.csv"
cmd <- glue::glue(cohort_template)

cmd

system(cmd)

5.1.9.1 Preview data in the retrieved data file

We’ll see that the retrieved data contains the integer and character codes. These must be decoded (see below):

data_df <- read_csv("cohort_data.csv", show_col_types = FALSE)
head(data_df)

5.2 Decoding columns with xvhelper

xvhelper is a little R package that will return the actual values of the returned data.

To use it, you build a coded_col_df using merge_coding_data_dict() and then translate the categorical columns to values using decode_categories(), and then change the column names to R friendly clean ones using decode_column_names().

Note that we need to run decode_df() before we run decode_category()

#install via remotes::install_github()
#install.packages("remotes")
remotes::install_github("laderast/xvhelper")

library(xvhelper)

coded_col_df <- xvhelper::merge_coding_data_dict(coding_dict = codings_df, data_dict = data_dict_df)

decoded <- data_df %>%
    xvhelper::decode_single(coded_col_df) |>
    xvhelper::decode_multi_purrr(coded_col_df) |>
    xvhelper::decode_column_names(coded_col_df, r_clean_names = FALSE)
    
head(decoded)

write.csv(decoded, file="cohort_decoded.csv")

5.2.1 Save Output to Project

system("dx upload *.csv --destination /users/tladeras/")