Name: MIMIC-III - SequenceExamples for TensorFlow modeling
Published: Sept. 29, 2020
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Jonas Kemp , Kun Zhang , Andrew Dai

Published: Sept. 29, 2020. Version: 1.0.0

When using this resource, please cite: (show more options)
Kemp, J., Zhang, K., & Dai, A. (2020). MIMIC-III - SequenceExamples for TensorFlow modeling (version 1.0.0). PhysioNet. https://doi.org/10.13026/n2v5-5b32.

MLA	Kemp, Jonas, et al. "MIMIC-III - SequenceExamples for TensorFlow modeling" (version 1.0.0). PhysioNet (2020), https://doi.org/10.13026/n2v5-5b32.
APA	Kemp, J., Zhang, K., & Dai, A. (2020). MIMIC-III - SequenceExamples for TensorFlow modeling (version 1.0.0). PhysioNet. https://doi.org/10.13026/n2v5-5b32.
Chicago	Kemp, Jonas, Zhang, Kun, and Andrew Dai. "MIMIC-III - SequenceExamples for TensorFlow modeling" (version 1.0.0). PhysioNet (2020). https://doi.org/10.13026/n2v5-5b32.
Harvard	Kemp, J., Zhang, K., and Dai, A. (2020) 'MIMIC-III - SequenceExamples for TensorFlow modeling' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/n2v5-5b32.
Vancouver	Kemp J, Zhang K, Dai A. MIMIC-III - SequenceExamples for TensorFlow modeling (version 1.0.0). PhysioNet. 2020. Available from: https://doi.org/10.13026/n2v5-5b32.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

This dataset contains TensorFlow SequenceExamples derived from patient records in MIMIC-III, a freely available set of deidentified medical records from critical care patients at Beth Israel Deaconess Medical Center. Each SequenceExample converts data from an individual patient encounter and any previous encounters into a set of timestamped “feature lists” describing the patient history up to a certain time, beyond which predictions can be made. These data are suitable for direct input into TensorFlow modeling pipelines, and include labels for inpatient mortality and discharge diagnosis codes for each encounter. The intent of this release is to provide a preprocessed, ready-to-use version of MIMIC-III to support and enable reproducible machine learning research for electronic health records.

Background

Since its release, MIMIC-III has become a common benchmark in machine learning and predictive modeling using electronic health records [1]. Because different analyses apply different preprocessing methods, precisely reproducing results across studies may be challenging. Some reusable open-source pipelines are now available [2, 3], but may not fit every use case. This version of the data was used to produce the results published in Kemp et al. [4].

Methods

These data were generated by 1) mapping MIMIC-III to the HL7 FHIR standard, and 2) converting the FHIR-formatted data to SequenceExamples using an open-source pipeline. FHIR is an open standard for representing and exchanging healthcare data, designed to promote interoperability. This methodology was first described in Rajkomar et al. [5]. Currently, the tools for mapping MIMIC-III to FHIR are not publicly available.

Data Description

The data are provided in TFRecord format, which can be ingested into Tensorflow models using the tf.data API. Individual records are encoded as tf.SequenceExamples. SequenceExamples contain a set of sequential features representing the patient state over time (e.g. a patient’s blood pressure or medication orders during a hospital visit). In the SequenceExample protocol buffer, these are encoded as “feature_lists”, consisting of key-value pairs with the feature name as the key and the list of feature values over time as the value. A feature_list may have many or no values at any given timestep. The SequenceExample also includes a set of “context” features applying to the whole example (e.g. patient age or gender), including the label for classification (e.g. inpatient mortality).

The dataset contains the following folders, each corresponding to a label or prediction task:

in_hospital_death_at_24hrs
in_hospital_death_at_end_of_encounter
inpatient_icd9_at_end_of_encounter
primary_ccs_at_end_of_encounter

Each folder contains a set of TFRecords comprising information on all MIMIC-III hospital admissions of at least 24 hours. (For primary_ccs_at_end_of_encounter, 1.3% of these encounters were excluded where the primary diagnosis corresponded to a non-billable ICD-9 code.) Each SequenceExample includes data for a given admission up to the time of prediction, as well as all data from past admissions for the same patient. The records are split into approximately 80% training, 10% validation and 10% test sets by patient ID. The versions of the data corresponding to each prediction task differ according to:

which labels are present in the context features (one of: in-hospital mortality, CCS code for primary discharge diagnosis, or ICD-9 codes for all discharge diagnoses)
the time of prediction, beyond which any data in the latest admission are dropped (24 hours after admission, or at the end of the encounter)

Each folder also includes a set of vocab files (provided as a ZIP archive). The filenames in this directory correspond to feature keys (for the context or feature_lists) in the SequenceExamples, and are derived from the names of the corresponding HL7 FHIR resources, e.g. “Composition.section.text.div.tokenized” for free-text notes. {feature_name}.txt lists all tokens in the vocabulary for the feature, and {feature_name}_freqs.csv lists the corresponding frequency of each token in the data.

Usage Notes

See the included Colab notebook for a usage example showing how to load the data and train a deep neural network to predict inpatient mortality. Prior to modeling, the data can be inspected by simply iterating over the Dataset created in the first section of the notebook, and directly extracting the parsed Tensors corresponding to any features of interest for further analysis.

Loading the data requires specifying the list of context and sequence features to parse from each example. Names, vocabularies, and frequencies for all discrete features can be found in the vocabulary directory for each set of sequence examples. In addition to these features, a handful of numeric features are also available for use (note that all time or date features are provided as timestamps in Unix time, in seconds):

Input features:
- Patient.birthDate: the patient birthdate.
- Observation.value.quantity.value: a catch-all feature for a variety of quantitative measurements, e.g. labs or vitals. Should be used jointly with the features Observation.code and Observation.value.quantity.unit, which together identify the type and unit of the corresponding measurement.
- MedicationAdministration.dosage.{dose.value, rate.quantity.value}: features containing information on dosage of administered medications. Similar to above, should be used in conjunction with corresponding MedicationAdministration.dosage features for e.g. route, site, or unit.
- MedicationRequest.dosageInstruction.dose.{quantity.value. range.low.value, range.high.value}: features containing information on dosage for medication orders. Similar to above, should be used in conjunction with corresponding MedicationRequest.dosageInstruction features for e.g. code, route, or unit.
Metadata:
- patientId: unique ID number for the patient in the given example. Patients with multiple encounters may appear in multiple examples.
- currentEncounterId: timestamp for the beginning of the current encounter.
- timestamp: the timestamp of the prediction (i.e. the end time of the example sequence).
- label.*.timestamp_secs: the timestamp of the label.
- eventId: timestamps corresponding to each event in the given example.
- sequenceLength: the number of distinct event times in the given example.

The following features should be treated as context features, while all others should be treated as sequence features:

Patient.birthDate
Patient.gender
patientId
currentEncounterId
sequenceLength
timestamp
label.*

Use of this dataset is subject to the MIMIC access requirements, as documented on the MIMIC website.

Acknowledgements

We would like to acknowledge Greg S. Corrado, Claire Cui, Gerardo Flores, Nick George, Michael D. Howell, Jake Marcus, Alexander Mossin, Eyal Oren, Alvin Rajkomar, Mimi Sun, Tejas Sundaresan, Patrik Sundberg, Justin Tansuwan, De Wang, and Yi Zhang for contributing to the development or otherwise supporting the creation of this resource.

Conflicts of Interest

Google supported the creation of this resource.

References

Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.H., Feng, M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.
Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Michael C. Hughes, Tristan Naumann, and Marzyeh Ghassemi. MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III. arXiv:1907.08322.
Michael W. Sjoding, Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, and Jenna Wiens. Democratizing EHR Analyses - a Comprehensive, Generalizable Pipeline for Learning from Clinical Data. Presented at MLHC (Machine Learning for Healthcare, Clinical Abstract), 2019.
Jonas Kemp, Alvin Rajkomar, and Andrew M. Dai. Improved Hierarchical Patient Classification with Language Model Pretraining over Clinical Notes. arXiv:1909.03039.
Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M. Dai, Nissan Hajaj, Michaela Hardt, Peter J. Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, Patrik Sundberg, Hector Yee, Kun Zhang, Yi Zhang, Gerardo Flores, Gavin E. Duggan, Jamie Irvine, Quoc Le, Kurt Litsch, Alexander Mossin, Justin Tansuwan, De Wang, James Wexler, Jimbo Wilson, Dana Ludwig, Samuel L. Volchenboum, Katherine Chou, Michael Pearson, Srinivasan Madabushi, Nigam H. Shah, Atul J. Butte, Michael D. Howell, Claire Cui, Greg S. Corrado & Jeffrey Dean. Scalable and accurate deep learning with electronic health records. npj Digital Med 1, 18 (2018). https://doi.org/10.1038/s41746-018-0029-1.