{"metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5"}}, "nbformat_minor": 5, "nbformat": 4, "cells": [{"cell_type": "markdown", "source": "# \u201cdx extract_dataset\u201d in Python\n
\n***As-Is Software Disclaimer***\n\nThis content in this repository is delivered \u201cAs-Is\u201d. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.\n\n
\n\nThis notebook demonstrates usage of the dx command `extract_dataset` for:\n* Retrieval of Apollo-stored data, as referenced within entities and fields of a Dataset or Cohort object on the platform\n* Retrieval of the underlying data dictionary files used to generate a Dataset object on the platform\n\nMIT License applies to this notebook.", "metadata": {}}, {"cell_type": "markdown", "source": "## Preparing your environment\n### Launch spec:\n\n* App name: JupyterLab with Python, R, Stata, ML ()\n* Kernel: Python\n* Instance type: mem1_ssd1_v2_x2\n* Cost: < $0.2\n* Runtime: =~ 10 min\n* Data description: Input for this notebook is a v3.0 Dataset or Cohort object ID", "metadata": {}}, {"cell_type": "markdown", "source": "### dxpy version\nextract_dataset requires dxpy version >= 0.329.0. If running the command from your local environment (i.e. off of the DNAnexus platform), it may be required to also install pandas. For example, pip3 install -U dxpy[pandas]", "metadata": {}}, {"cell_type": "code", "source": "import subprocess\nimport dxpy\nimport pandas as pd\nimport os\nimport glob\npd.set_option('display.max_columns', None)", "metadata": {"trusted": true}, "execution_count": 3, "outputs": []}, {"cell_type": "code", "source": "dxpy.__version__", "metadata": {"trusted": true}, "execution_count": 2, "outputs": [{"execution_count": 2, "output_type": "execute_result", "data": {"text/plain": "'0.330.0'"}, "metadata": {}}]}, {"cell_type": "markdown", "source": "### 1. Assign environment variables", "metadata": {}}, {"cell_type": "code", "source": "# The referenced Dataset is private and provided only to demonstrate an example input. The user will need to supply a permissible and valid record-id\n# Assign project-id of dataset\n# Assign dataset record-id\nrid = 'record-G5Ky4Gj08KQYQ4P810fJ8qPp'\n# Assign joint dataset project-id:record-id\n#dataset = (':').join([pid, rid])\n\ndataset = rid", "metadata": {"trusted": true}, "execution_count": 4, "outputs": []}, {"cell_type": "markdown", "source": "### 2. Call \u201cdx extract_dataset\u201d using a supplied dataset", "metadata": {}}, {"cell_type": "code", "source": "cmd = [\"dx\", \"extract_dataset\", dataset, \"-ddd\", \"--delimiter\", \",\"]\nsubprocess.check_call(cmd)", "metadata": {"trusted": true}, "execution_count": 5, "outputs": [{"execution_count": 5, "output_type": "execute_result", "data": {"text/plain": "0"}, "metadata": {}}]}, {"cell_type": "markdown", "source": "#### Preview data in the three dictionary (*.csv) files", "metadata": {}}, {"cell_type": "code", "source": "path = os.getcwd()", "metadata": {"trusted": true}, "execution_count": 6, "outputs": []}, {"cell_type": "code", "source": "data_dict_csv = glob.glob(os.path.join(path, \"*.data_dictionary.csv\"))[0]\ndata_dict_df = pd.read_csv(data_dict_csv)\ndata_dict_df.head()", "metadata": {"trusted": true}, "execution_count": 7, "outputs": [{"execution_count": 7, "output_type": "execute_result", "data": {"text/plain": " entity name type primary_key_type coding_name \\\n0 participant p22608_a24 integer NaN data_coding_493 \n1 participant p2784_i1 integer NaN data_coding_100349 \n2 participant p102780_i4 integer NaN data_coding_100001 \n3 participant p41217 integer NaN data_coding_228 \n4 participant p22704_a7 integer NaN NaN \n\n concept description folder_path \\\n0 NaN NaN Online follow-up > Work environment > Employme... \n1 NaN NaN UK Biobank Assessment Centre > Touchscreen > S... \n2 NaN NaN Online follow-up > Diet by 24-hour recall > Br... \n3 NaN NaN Health-related outcomes > Hospital inpatient >... \n4 NaN NaN Additional exposures > Local environment > Hom... \n\n is_multi_select is_sparse_coding \\\n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 yes NaN \n4 NaN NaN \n\n linkout longitudinal_axis_type \\\n0 http://biobank.ctsu.ox.ac.uk/crystal/field.cgi... NaN \n1 http://biobank.ctsu.ox.ac.uk/crystal/field.cgi... NaN \n2 http://biobank.ctsu.ox.ac.uk/crystal/field.cgi... NaN \n3 http://biobank.ctsu.ox.ac.uk/crystal/field.cgi... NaN \n4 http://biobank.ctsu.ox.ac.uk/crystal/field.cgi... NaN \n\n referenced_entity_field relationship \\\n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 NaN NaN \n4 NaN NaN \n\n title units \n0 Workplace very hot | Array 24 NaN \n1 Ever taken oral contraceptive pill | Instance 1 NaN \n2 Other grain intake | Instance 4 Serving \n3 Mental categories NaN \n4 Home location - north co-ordinate (rounded) | ... metre-grid ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
entitynametypeprimary_key_typecoding_nameconceptdescriptionfolder_pathis_multi_selectis_sparse_codinglinkoutlongitudinal_axis_typereferenced_entity_fieldrelationshiptitleunits
0participantp22608_a24integerNaNdata_coding_493NaNNaNOnline follow-up > Work environment > Employme...NaNNaNhttp://biobank.ctsu.ox.ac.uk/crystal/field.cgi...NaNNaNNaNWorkplace very hot | Array 24NaN
1participantp2784_i1integerNaNdata_coding_100349NaNNaNUK Biobank Assessment Centre > Touchscreen > S...NaNNaNhttp://biobank.ctsu.ox.ac.uk/crystal/field.cgi...NaNNaNNaNEver taken oral contraceptive pill | Instance 1NaN
2participantp102780_i4integerNaNdata_coding_100001NaNNaNOnline follow-up > Diet by 24-hour recall > Br...NaNNaNhttp://biobank.ctsu.ox.ac.uk/crystal/field.cgi...NaNNaNNaNOther grain intake | Instance 4Serving
3participantp41217integerNaNdata_coding_228NaNNaNHealth-related outcomes > Hospital inpatient >...yesNaNhttp://biobank.ctsu.ox.ac.uk/crystal/field.cgi...NaNNaNNaNMental categoriesNaN
4participantp22704_a7integerNaNNaNNaNNaNAdditional exposures > Local environment > Hom...NaNNaNhttp://biobank.ctsu.ox.ac.uk/crystal/field.cgi...NaNNaNNaNHome location - north co-ordinate (rounded) | ...metre-grid
\n
"}, "metadata": {}}]}, {"cell_type": "code", "source": "codings_csv = glob.glob(os.path.join(path, \"*.codings.csv\"))[0]\ncodings_df = pd.read_csv(codings_csv)\ncodings_df.head()", "metadata": {"trusted": true}, "execution_count": 8, "outputs": [{"execution_count": 8, "output_type": "execute_result", "data": {"text/plain": " coding_name code meaning concept display_order parent_code\n0 data_coding_493 -121 Do not know NaN 1 NaN\n1 data_coding_493 -131 Sometimes NaN 2 NaN\n2 data_coding_493 -141 Often NaN 3 NaN\n3 data_coding_493 0 Rarely/never NaN 4 NaN\n4 data_coding_100349 1 Yes NaN 1 NaN", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
coding_namecodemeaningconceptdisplay_orderparent_code
0data_coding_493-121Do not knowNaN1NaN
1data_coding_493-131SometimesNaN2NaN
2data_coding_493-141OftenNaN3NaN
3data_coding_4930Rarely/neverNaN4NaN
4data_coding_1003491YesNaN1NaN
\n
"}, "metadata": {}}]}, {"cell_type": "code", "source": "entity_dict_csv = glob.glob(os.path.join(path, \"*.entity_dictionary.csv\"))[0]\nentity_dict_df = pd.read_csv(entity_dict_csv)\nentity_dict_df.head()", "metadata": {"trusted": true}, "execution_count": 9, "outputs": [{"execution_count": 9, "output_type": "execute_result", "data": {"text/plain": " entity entity_title entity_description \\\n0 participant Participant Participant data \n1 pheno_geno_sample_ids pheno_geno_sample_ids pheno_geno_sample_ids \n\n entity_label_plural entity_label_singular \n0 Participants Participant \n1 pheno_geno_sample_ids pheno_geno_sample_ids ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
entityentity_titleentity_descriptionentity_label_pluralentity_label_singular
0participantParticipantParticipant dataParticipantsParticipant
1pheno_geno_sample_idspheno_geno_sample_idspheno_geno_sample_idspheno_geno_sample_idspheno_geno_sample_ids
\n
"}, "metadata": {}}]}, {"cell_type": "markdown", "source": "### 3. Parse returned metadata and extract entity/field names", "metadata": {}}, {"cell_type": "code", "source": "data_dict_df['ent_field'] = data_dict_df['entity'].astype(str) + \\\n '.' + data_dict_df['name'].astype(str)\n \nentity_field = data_dict_df.ent_field.values.tolist()\nentity_field2 = entity_field[500:530]\nentity_field2.append('participant.eid')\nentity_field2", "metadata": {"trusted": true}, "execution_count": 17, "outputs": [{"execution_count": 17, "output_type": "execute_result", "data": {"text/plain": "['participant.p5983_i0_a93',\n 'participant.p20008_i2_a19',\n 'participant.p10147_i0_a0',\n 'participant.p25329_i2',\n 'participant.p20014_i2_a5',\n 'participant.p5078_i0_a0',\n 'participant.p5877_i2',\n 'participant.p102830_i1',\n 'participant.p2624_i1',\n 'participant.p2754_i2',\n 'participant.p4244_i1_a12',\n 'participant.p25686_i2',\n 'participant.p25836_i2',\n 'participant.p4260_i2_a9',\n 'participant.p4245_i0_a15',\n 'participant.p20229_a113',\n 'participant.p30001_i0',\n 'participant.p1249_i1',\n 'participant.p1970_i1',\n 'participant.p5141_i1_a2',\n 'participant.p1448_i2',\n 'participant.p22602_a13',\n 'participant.p22140',\n 'participant.p25586_i2',\n 'participant.p20010_i2_a1',\n 'participant.p100320_i4',\n 'participant.p401_i1_a2',\n 'participant.p25005_i2',\n 'participant.p5077_i0_a1',\n 'participant.p20117_i0',\n 'participant.eid']"}, "metadata": {}}]}, {"cell_type": "markdown", "source": "### 4. Use extracted entity and field names as input to the called function, \u201cdx extract_dataset\u201d and extract data", "metadata": {}}, {"cell_type": "code", "source": "cmd = [\"dx\", \"extract_dataset\", dataset, \"--fields\", ','.join(entity_field2), \n \"-o\", \"cohort_data.csv\"]\n\nsubprocess.check_call(cmd)", "metadata": {"trusted": true}, "execution_count": 18, "outputs": [{"execution_count": 18, "output_type": "execute_result", "data": {"text/plain": "0"}, "metadata": {}}]}, {"cell_type": "markdown", "source": "#### Print data in the retrieved data file", "metadata": {}}, {"cell_type": "code", "source": "fields_file_df = pd.read_csv(\"cohort_data.csv\", float_precision='round_trip')\nfields_file_df.head()", "metadata": {"trusted": true}, "execution_count": 19, "outputs": [{"execution_count": 19, "output_type": "execute_result", "data": {"text/plain": " participant.p5983_i0_a93 participant.p20008_i2_a19 \\\n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 NaN NaN \n4 NaN NaN \n\n participant.p10147_i0_a0 participant.p25329_i2 participant.p20014_i2_a5 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p5078_i0_a0 participant.p5877_i2 participant.p102830_i1 \\\n0 NaN NaN NaN \n1 0.0 NaN NaN \n2 NaN NaN NaN \n3 -0.1 NaN NaN \n4 NaN NaN NaN \n\n participant.p2624_i1 participant.p2754_i2 participant.p4244_i1_a12 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p25686_i2 participant.p25836_i2 participant.p4260_i2_a9 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN 6419.67 NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p4245_i0_a15 participant.p20229_a113 participant.p30001_i0 \\\n0 24.0 NaN 0.0 \n1 703.0 NaN 0.0 \n2 NaN NaN 0.0 \n3 NaN NaN 0.0 \n4 32.0 NaN 0.0 \n\n participant.p1249_i1 participant.p1970_i1 participant.p5141_i1_a2 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN 0.0 NaN \n4 NaN NaN NaN \n\n participant.p1448_i2 participant.p22602_a13 participant.p22140 \\\n0 NaN NaN 0.0 \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN 0.0 \n\n participant.p25586_i2 participant.p20010_i2_a1 participant.p100320_i4 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p401_i1_a2 participant.p25005_i2 participant.p5077_i0_a1 \\\n0 NaN NaN NaN \n1 NaN NaN 1.0 \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p20117_i0 participant.eid \n0 2.0 sample_100_116 \n1 2.0 sample_100_142 \n2 2.0 sample_100_285 \n3 2.0 sample_100_290 \n4 2.0 sample_100_304 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
participant.p5983_i0_a93participant.p20008_i2_a19participant.p10147_i0_a0participant.p25329_i2participant.p20014_i2_a5participant.p5078_i0_a0participant.p5877_i2participant.p102830_i1participant.p2624_i1participant.p2754_i2participant.p4244_i1_a12participant.p25686_i2participant.p25836_i2participant.p4260_i2_a9participant.p4245_i0_a15participant.p20229_a113participant.p30001_i0participant.p1249_i1participant.p1970_i1participant.p5141_i1_a2participant.p1448_i2participant.p22602_a13participant.p22140participant.p25586_i2participant.p20010_i2_a1participant.p100320_i4participant.p401_i1_a2participant.p25005_i2participant.p5077_i0_a1participant.p20117_i0participant.eid
0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN24.0NaN0.0NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN2.0sample_100_116
1NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaN703.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.02.0sample_100_142
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN6419.67NaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2.0sample_100_285
3NaNNaNNaNNaNNaN-0.1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2.0sample_100_290
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN32.0NaN0.0NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN2.0sample_100_304
\n
"}, "metadata": {}}]}, {"cell_type": "markdown", "source": "### 5. Replace any coded column values of extracted data with the coded meaning", "metadata": {}}, {"cell_type": "code", "source": "data_dict_df['coding_value_type'] = data_dict_df['is_multi_select'].apply(\n lambda x: \"list\" if x == \"yes\" else \"string\")", "metadata": {"trusted": true}, "execution_count": 20, "outputs": []}, {"cell_type": "code", "source": "fields_file_df_decoded = fields_file_df.copy(deep=True)\n\ndef get_meaning(code_name, code):\n if isinstance(code, int):\n code = str(code)\n elif isinstance(code, float):\n code = str(code)\n # If field type is float, and an integer sparse code is used for the field \n # (example `1`), the retrieved data represents it as a float (`1.0`).\n # Strip the `.0` suffix and search for the code in codings dataframe\n if codings_df.loc[(codings_df[\"coding_name\"]== code_name) & \n (codings_df[\"code\"]== code), \"meaning\"].empty:\n if code.endswith('.0'):\n code = code[:-2]\n return(codings_df.loc[(codings_df[\"coding_name\"]== code_name) & \n (codings_df[\"code\"]== code), \"meaning\"])\n\nfor (columnName, columnData) in fields_file_df_decoded.iteritems():\n code_name, data_type= data_dict_df[(data_dict_df[\"ent_field\"]==columnName)][\n [\"coding_name\", \"coding_value_type\"]].values[0]\n if not pd.isna(code_name):\n set_of_values = set(columnData.dropna())\n for val in set_of_values:\n if data_type == \"list\":\n new_val = []\n list_val = eval(val)\n for i in list_val:\n meaning = get_meaning(code_name, i)\n if not meaning.empty:\n new_val.append(meaning.values.item())\n else:\n new_val.append(i)\n fields_file_df_decoded.loc[fields_file_df_decoded[columnName] == val, \n columnName] = str(new_val)\n continue\n elif data_type == \"string\":\n meaning = get_meaning(code_name, val)\n if not meaning.empty:\n fields_file_df_decoded.loc[fields_file_df_decoded[columnName] == val, \n columnName] = meaning.values.item()\nfields_file_df_decoded", "metadata": {"trusted": true}, "execution_count": 21, "outputs": [{"execution_count": 21, "output_type": "execute_result", "data": {"text/plain": " participant.p5983_i0_a93 participant.p20008_i2_a19 \\\n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 NaN NaN \n4 NaN NaN \n... ... ... \n17012 NaN NaN \n17013 NaN NaN \n17014 NaN NaN \n17015 NaN NaN \n17016 NaN NaN \n\n participant.p10147_i0_a0 participant.p25329_i2 \\\n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 NaN NaN \n4 NaN NaN \n... ... ... \n17012 NaN NaN \n17013 NaN NaN \n17014 NaN NaN \n17015 NaN NaN \n17016 NaN NaN \n\n participant.p20014_i2_a5 participant.p5078_i0_a0 participant.p5877_i2 \\\n0 NaN NaN NaN \n1 NaN 0.0 NaN \n2 NaN NaN NaN \n3 NaN -0.1 NaN \n4 NaN NaN NaN \n... ... ... ... \n17012 NaN -0.1 NaN \n17013 NaN NaN NaN \n17014 NaN NaN NaN \n17015 NaN 0.2 NaN \n17016 NaN NaN NaN \n\n participant.p102830_i1 participant.p2624_i1 participant.p2754_i2 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n... ... ... ... \n17012 NaN NaN NaN \n17013 NaN NaN NaN \n17014 NaN NaN NaN \n17015 NaN NaN NaN \n17016 NaN NaN NaN \n\n participant.p4244_i1_a12 participant.p25686_i2 participant.p25836_i2 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN 6419.67 \n3 NaN NaN NaN \n4 NaN NaN NaN \n... ... ... ... \n17012 NaN NaN NaN \n17013 NaN NaN NaN \n17014 NaN NaN NaN \n17015 NaN NaN NaN \n17016 NaN NaN NaN \n\n participant.p4260_i2_a9 participant.p4245_i0_a15 \\\n0 NaN 24.0 \n1 NaN 703.0 \n2 NaN NaN \n3 NaN NaN \n4 NaN 32.0 \n... ... ... \n17012 NaN 20.0 \n17013 NaN 17.0 \n17014 NaN 15.0 \n17015 NaN 11.0 \n17016 NaN 20.0 \n\n participant.p20229_a113 participant.p30001_i0 participant.p1249_i1 \\\n0 NaN 0.0 NaN \n1 NaN 0.0 NaN \n2 NaN 0.0 NaN \n3 NaN 0.0 NaN \n4 NaN 0.0 NaN \n... ... ... ... \n17012 NaN 0.0 NaN \n17013 NaN 0.0 NaN \n17014 NaN 0.0 NaN \n17015 NaN 0.0 NaN \n17016 NaN 0.0 NaN \n\n participant.p1970_i1 participant.p5141_i1_a2 participant.p1448_i2 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 No NaN NaN \n4 NaN NaN NaN \n... ... ... ... \n17012 NaN NaN NaN \n17013 NaN NaN NaN \n17014 NaN NaN NaN \n17015 NaN NaN NaN \n17016 NaN NaN NaN \n\n participant.p22602_a13 participant.p22140 participant.p25586_i2 \\\n0 NaN No NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN No NaN \n... ... ... ... \n17012 NaN NaN NaN \n17013 NaN NaN NaN \n17014 NaN NaN NaN \n17015 NaN NaN NaN \n17016 NaN No NaN \n\n participant.p20010_i2_a1 participant.p100320_i4 participant.p401_i1_a2 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n... ... ... ... \n17012 NaN NaN NaN \n17013 NaN NaN NaN \n17014 NaN NaN NaN \n17015 NaN NaN NaN \n17016 NaN NaN NaN \n\n participant.p25005_i2 participant.p5077_i0_a1 participant.p20117_i0 \\\n0 NaN NaN Current \n1 NaN 1.0 Current \n2 NaN NaN Current \n3 NaN NaN Current \n4 NaN NaN Current \n... ... ... ... \n17012 NaN NaN Current \n17013 NaN 5.0 Current \n17014 NaN 0.0 Current \n17015 NaN 2.0 Current \n17016 NaN NaN Current \n\n participant.eid \n0 sample_100_116 \n1 sample_100_142 \n2 sample_100_285 \n3 sample_100_290 \n4 sample_100_304 \n... ... \n17012 sample_9_290 \n17013 sample_9_312 \n17014 sample_9_320 \n17015 sample_9_394 \n17016 sample_9_82 \n\n[17017 rows x 31 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
participant.p5983_i0_a93participant.p20008_i2_a19participant.p10147_i0_a0participant.p25329_i2participant.p20014_i2_a5participant.p5078_i0_a0participant.p5877_i2participant.p102830_i1participant.p2624_i1participant.p2754_i2participant.p4244_i1_a12participant.p25686_i2participant.p25836_i2participant.p4260_i2_a9participant.p4245_i0_a15participant.p20229_a113participant.p30001_i0participant.p1249_i1participant.p1970_i1participant.p5141_i1_a2participant.p1448_i2participant.p22602_a13participant.p22140participant.p25586_i2participant.p20010_i2_a1participant.p100320_i4participant.p401_i1_a2participant.p25005_i2participant.p5077_i0_a1participant.p20117_i0participant.eid
0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN24.0NaN0.0NaNNaNNaNNaNNaNNoNaNNaNNaNNaNNaNNaNCurrentsample_100_116
1NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaN703.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.0Currentsample_100_142
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN6419.67NaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNCurrentsample_100_285
3NaNNaNNaNNaNNaN-0.1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0NaNNoNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNCurrentsample_100_290
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN32.0NaN0.0NaNNaNNaNNaNNaNNoNaNNaNNaNNaNNaNNaNCurrentsample_100_304
................................................................................................
17012NaNNaNNaNNaNNaN-0.1NaNNaNNaNNaNNaNNaNNaNNaN20.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNCurrentsample_9_290
17013NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN17.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN5.0Currentsample_9_312
17014NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN15.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0Currentsample_9_320
17015NaNNaNNaNNaNNaN0.2NaNNaNNaNNaNNaNNaNNaNNaN11.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2.0Currentsample_9_394
17016NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN20.0NaN0.0NaNNaNNaNNaNNaNNoNaNNaNNaNNaNNaNNaNCurrentsample_9_82
\n

17017 rows \u00d7 31 columns

\n
"}, "metadata": {}}]}, {"cell_type": "code", "source": "fields_file_df_decoded.to_csv(\"extracted_data_with_code_meanings.csv\", index=False)", "metadata": {"trusted": true}, "execution_count": 22, "outputs": []}, {"cell_type": "markdown", "source": "### 6. Drop sparsely coded values", "metadata": {}}, {"cell_type": "code", "source": "fields_sparse_code = fields_file_df.copy(deep=True)", "metadata": {"trusted": true}, "execution_count": 23, "outputs": []}, {"cell_type": "code", "source": "for (columnName, columnData) in fields_sparse_code.iteritems():\n code_name, data_type, is_sparse_coding= data_dict_df[\n (data_dict_df[\"ent_field\"]==columnName)][\n [\"coding_name\", \"coding_value_type\", \"is_sparse_coding\"]].values[0]\n if not (pd.isna(code_name) and pd.isna(is_sparse_coding)) and \\\n is_sparse_coding=='yes':\n set_of_values = set(columnData.dropna())\n for val in set_of_values:\n if data_type == \"list\":\n new_val = []\n list_val = eval(val)\n for i in list_val:\n meaning = get_meaning(code_name, i)\n if meaning.empty:\n new_val.append(i)\n fields_sparse_code.loc[fields_sparse_code[columnName] == val, \n columnName] = str(new_val)\n continue\n elif data_type == \"string\":\n meaning = get_meaning(code_name, val)\n if not meaning.empty:\n fields_sparse_code.loc[fields_sparse_code[columnName] == val, \n columnName] = None\nfields_sparse_code.head()", "metadata": {"trusted": true}, "execution_count": 24, "outputs": [{"execution_count": 24, "output_type": "execute_result", "data": {"text/plain": " participant.p5983_i0_a93 participant.p20008_i2_a19 \\\n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 NaN NaN \n4 NaN NaN \n\n participant.p10147_i0_a0 participant.p25329_i2 participant.p20014_i2_a5 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p5078_i0_a0 participant.p5877_i2 participant.p102830_i1 \\\n0 NaN NaN NaN \n1 0.0 NaN NaN \n2 NaN NaN NaN \n3 -0.1 NaN NaN \n4 NaN NaN NaN \n\n participant.p2624_i1 participant.p2754_i2 participant.p4244_i1_a12 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p25686_i2 participant.p25836_i2 participant.p4260_i2_a9 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN 6419.67 NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p4245_i0_a15 participant.p20229_a113 participant.p30001_i0 \\\n0 24.0 NaN 0.0 \n1 703.0 NaN 0.0 \n2 NaN NaN 0.0 \n3 NaN NaN 0.0 \n4 32.0 NaN 0.0 \n\n participant.p1249_i1 participant.p1970_i1 participant.p5141_i1_a2 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN 0.0 NaN \n4 NaN NaN NaN \n\n participant.p1448_i2 participant.p22602_a13 participant.p22140 \\\n0 NaN NaN 0.0 \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN 0.0 \n\n participant.p25586_i2 participant.p20010_i2_a1 participant.p100320_i4 \\\n0 NaN NaN NaN \n1 NaN NaN NaN \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p401_i1_a2 participant.p25005_i2 participant.p5077_i0_a1 \\\n0 NaN NaN NaN \n1 NaN NaN 1.0 \n2 NaN NaN NaN \n3 NaN NaN NaN \n4 NaN NaN NaN \n\n participant.p20117_i0 participant.eid \n0 2.0 sample_100_116 \n1 2.0 sample_100_142 \n2 2.0 sample_100_285 \n3 2.0 sample_100_290 \n4 2.0 sample_100_304 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
participant.p5983_i0_a93participant.p20008_i2_a19participant.p10147_i0_a0participant.p25329_i2participant.p20014_i2_a5participant.p5078_i0_a0participant.p5877_i2participant.p102830_i1participant.p2624_i1participant.p2754_i2participant.p4244_i1_a12participant.p25686_i2participant.p25836_i2participant.p4260_i2_a9participant.p4245_i0_a15participant.p20229_a113participant.p30001_i0participant.p1249_i1participant.p1970_i1participant.p5141_i1_a2participant.p1448_i2participant.p22602_a13participant.p22140participant.p25586_i2participant.p20010_i2_a1participant.p100320_i4participant.p401_i1_a2participant.p25005_i2participant.p5077_i0_a1participant.p20117_i0participant.eid
0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN24.0NaN0.0NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN2.0sample_100_116
1NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaN703.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.02.0sample_100_142
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN6419.67NaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2.0sample_100_285
3NaNNaNNaNNaNNaN-0.1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2.0sample_100_290
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN32.0NaN0.0NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN2.0sample_100_304
\n
"}, "metadata": {}}]}, {"cell_type": "code", "source": "fields_sparse_code.to_csv(\"extracted_data_with_sparse_code_drop.csv\", index=False)", "metadata": {"trusted": true}, "execution_count": 25, "outputs": []}, {"cell_type": "markdown", "source": "### 7. Replace the column titles (field names) of extracted data with the field titles", "metadata": {}}, {"cell_type": "code", "source": "current_columns = list(fields_file_df.columns)", "metadata": {"trusted": true}, "execution_count": 26, "outputs": []}, {"cell_type": "code", "source": "new_columns = {}\ntitles = []\nduplicate_titles = []\nfor val in current_columns:\n meaning = data_dict_df.loc[data_dict_df[\"ent_field\"]==val, \n \"title\"].values.item()\n if meaning not in titles:\n titles.append(meaning)\n elif meaning not in duplicate_titles:\n duplicate_titles.append(meaning)\nfor val in current_columns:\n meaning = data_dict_df.loc[data_dict_df[\"ent_field\"]==val, \n \"title\"].values.item()\n if meaning not in duplicate_titles:\n new_columns[val] = meaning\n else:\n new_columns[val] = val.replace(\".\", \"-\")", "metadata": {"trusted": true}, "execution_count": 27, "outputs": []}, {"cell_type": "code", "source": "fields_file_df.rename(columns = new_columns, inplace = True)\nfields_file_df.head()", "metadata": {"trusted": true}, "execution_count": 28, "outputs": [{"execution_count": 28, "output_type": "execute_result", "data": {"text/plain": " ECG, heart rate | Instance 0 | Array 93 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Interpolated Year when non-cancer illness first diagnosed | Instance 2 | Array 19 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Duration to first press of snap-button in each round (pilot) | Instance 0 | Array 0 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Mean L3 in external capsule on FA skeleton (left) | Instance 2 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Method of recording time when operation occurred | Instance 2 | Array 5 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n logMAR in round (left) | Instance 0 | Array 0 \\\n0 NaN \n1 0.0 \n2 NaN \n3 -0.1 \n4 NaN \n\n Which eye(s) affected by other eye condition | Instance 2 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Soft cheese intake | Instance 1 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Frequency of heavy DIY in last 4 weeks | Instance 1 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Age at first live birth | Instance 2 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Mean signal-to-noise ratio (SNR), (right) | Instance 1 | Array 12 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Weighted-mean OD in tract corticospinal tract (right) | Instance 2 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Volume of grey matter in Paracingulate Gyrus (left) | Instance 2 \\\n0 NaN \n1 NaN \n2 6419.67 \n3 NaN \n4 NaN \n\n Round of numeric memory test | Instance 2 | Array 9 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Time to press first digit (right) | Instance 0 | Array 15 \\\n0 24.0 \n1 703.0 \n2 NaN \n3 NaN \n4 32.0 \n\n Values entered | Array 113 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n White blood cell (leukocyte) count freeze-thaw cycles | Instance 0 \\\n0 0.0 \n1 0.0 \n2 0.0 \n3 0.0 \n4 0.0 \n\n Past tobacco smoking | Instance 1 Nervous feelings | Instance 1 \\\n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 NaN 0.0 \n4 NaN NaN \n\n 3mm asymmetry index unreliable (left) | Instance 1 | Array 2 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Bread type | Instance 2 Year job started | Array 13 \\\n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 NaN NaN \n4 NaN NaN \n\n Doctor diagnosed lung cancer (not mesothelioma) \\\n0 0.0 \n1 NaN \n2 NaN \n3 NaN \n4 0.0 \n\n Weighted-mean L1 in tract medial lemniscus (left) | Instance 2 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Interpolated Year when operation took place | Instance 2 | Array 1 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Added milk to espresso | Instance 4 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Index for card A in round | Instance 1 | Array 2 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Volume of grey matter (normalised for head size) | Instance 2 \\\n0 NaN \n1 NaN \n2 NaN \n3 NaN \n4 NaN \n\n Number of letters correct in round (left) | Instance 0 | Array 1 \\\n0 NaN \n1 1.0 \n2 NaN \n3 NaN \n4 NaN \n\n Alcohol drinker status | Instance 0 Participant ID \n0 2.0 sample_100_116 \n1 2.0 sample_100_142 \n2 2.0 sample_100_285 \n3 2.0 sample_100_290 \n4 2.0 sample_100_304 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ECG, heart rate | Instance 0 | Array 93Interpolated Year when non-cancer illness first diagnosed | Instance 2 | Array 19Duration to first press of snap-button in each round (pilot) | Instance 0 | Array 0Mean L3 in external capsule on FA skeleton (left) | Instance 2Method of recording time when operation occurred | Instance 2 | Array 5logMAR in round (left) | Instance 0 | Array 0Which eye(s) affected by other eye condition | Instance 2Soft cheese intake | Instance 1Frequency of heavy DIY in last 4 weeks | Instance 1Age at first live birth | Instance 2Mean signal-to-noise ratio (SNR), (right) | Instance 1 | Array 12Weighted-mean OD in tract corticospinal tract (right) | Instance 2Volume of grey matter in Paracingulate Gyrus (left) | Instance 2Round of numeric memory test | Instance 2 | Array 9Time to press first digit (right) | Instance 0 | Array 15Values entered | Array 113White blood cell (leukocyte) count freeze-thaw cycles | Instance 0Past tobacco smoking | Instance 1Nervous feelings | Instance 13mm asymmetry index unreliable (left) | Instance 1 | Array 2Bread type | Instance 2Year job started | Array 13Doctor diagnosed lung cancer (not mesothelioma)Weighted-mean L1 in tract medial lemniscus (left) | Instance 2Interpolated Year when operation took place | Instance 2 | Array 1Added milk to espresso | Instance 4Index for card A in round | Instance 1 | Array 2Volume of grey matter (normalised for head size) | Instance 2Number of letters correct in round (left) | Instance 0 | Array 1Alcohol drinker status | Instance 0Participant ID
0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN24.0NaN0.0NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN2.0sample_100_116
1NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaN703.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1.02.0sample_100_142
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN6419.67NaNNaNNaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2.0sample_100_285
3NaNNaNNaNNaNNaN-0.1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN0.0NaN0.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2.0sample_100_290
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN32.0NaN0.0NaNNaNNaNNaNNaN0.0NaNNaNNaNNaNNaNNaN2.0sample_100_304
\n
"}, "metadata": {}}]}, {"cell_type": "code", "source": "fields_file_df.to_csv(\"extracted_data_with_updated_titles.csv\", index=False)", "metadata": {"trusted": true}, "execution_count": 29, "outputs": []}, {"cell_type": "markdown", "source": "### 8. Upload extracted dictionaries and data back to the project", "metadata": {}}, {"cell_type": "code", "source": "cmd = \"dx upload *.csv --destination /users/tladeras/\"\nsubprocess.check_call(cmd, shell=True)", "metadata": {"trusted": true}, "execution_count": null, "outputs": []}]}