Database Credentialed Access

MIMIC-III and eICU-CRD: Feature Representation by FIDDLE Preprocessing

Shengpu Tang Parmida Davarmanesh Yanmeng Song Danai Koutra Michael Sjoding Jenna Wiens

Published: April 28, 2021. Version: 1.0.0

When using this resource, please cite: (show more options)
Tang, S., Davarmanesh, P., Song, Y., Koutra, D., Sjoding, M., & Wiens, J. (2021). MIMIC-III and eICU-CRD: Feature Representation by FIDDLE Preprocessing (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Tang, S., Davarmanesh, P., Song, Y., Koutra, D., Sjoding, M. W., & Wiens, J. (2020). Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. Journal of the American Medical Informatics Association, 27(12), 1921-1934.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


This is a preprocessed dataset derived from patient records in MIMIC-III and eICU, two large-scale electronic health record (EHR) databases. It contains features and labels for 5 prediction tasks involving 3 adverse outcomes (prediction times listed in parentheses): in-hospital mortality (48h), acute respiratory failure (4h and 12h), and shock (4h and 12h). We extracted comprehensive, high-dimensional feature representations (up to ~8,000 features) using FIDDLE (FlexIble Data-Driven pipeLinE), an open-source preprocessing pipeline for structured clinical data. These 5 prediction tasks were designed in consultation with a critical care physician for their clinical importance, and were used as part of the proof-of-concept experiments in the original paper to demonstrate FIDDLE's utility in aiding the feature engineering step of machine learning model development. The intent of this release is to share preprocessed MIMIC-III and eICU datasets used in the experiments to support and enable reproducible machine learning research on EHR data. 


To date, researchers have successfully leveraged electronic health record (EHR) data and machine learning (ML) tools to build patient risk stratification models for many adverse outcomes. However, prior to applying ML, substantial effort must be devoted to preprocessing. EHR data are messy, often consisting of high-dimensional, irregularly sampled time series with multiple data types and missing values. Transforming EHR data into feature vectors suitable for ML techniques requires many decisions, such as what input variables to include, how to resample longitudinal data, and how to handle missing data, among many others. Currently, EHR data preprocessing is largely ad hoc and can vary widely between studies. In an effort to speed up and standardize the preprocessing of EHR data, we proposed FIDDLE [1], a tool that systematically transforms structured EHR data into representations that can be used as inputs to ML algorithms. We evaluated FIDDLE through a proof-of-concept experiment in the context of MIMIC-III and eICU [2-5]. This version of the data was used to produce the results published by Tang et al. [1].


From MIMIC-III, we focused on 17,710 patients (23,620 ICU visits) monitored using the iMDSoft MetaVision system (2008–2012) for its relative recency over the Philips CareVue system (2001–2008), thus representing more up-to-date clinical practices [2,3]. Each ICU visit is identified by a unique ICUSTAY_ID.

The eICU Collaborative Research Database consists of data from 139,367 patients (200,859 ICU visits) who were admitted to 200 different ICUs located throughout the United States in 2014 and 2015 [4,5]. Each ICU visit is identified by a unique patientunitstayid.

For both databases, we extracted data from structured tables that pertain to patient health:

  • eICU (18 tables): patient, vitalPeriodic, vitalAperiodic, lab, customLab, medication, infusionDrug, intakeOutput, microLab, note, nurseAssessment, nurseCare, nurseCharting, pastHistory, physicalExam, respiratoryCare, respiratoryCharting, treatment

We formatted the data into a table with 4 columns: [ID, t, variable_name, variable_value] and then applied FIDDLE (using the default settings) on the processed data tables for each of the 5 prediction tasks to convert them into feature matrices. A snapshot of the code used for data extraction and preprocessing has been included in the code/ folder, but please refer to the project GitHub repository for the latest version [6].

Data Description

The root folder contains 2 subfolders, FIDDLE_mimic3/ and FIDDLE_eicu/, within each there are 2 subfolders: population/ and features/.

  • population/ contains 5 files, specifying the population of ICU stays of each prediction task. It also contains the onset hour (for ARF and shock) as well as the binary label for the adverse outcome. 
    • mortality_48h.csv
    • ARF_4h.csv
    • ARF_12h.csv
    • Shock_4h.csv
    • Shock_12h.csv
  • features/ includes subfolders (named identically to the population files) that contain the time-invariant features s and time-dependent features X for the corresponding prediction task. These features can be used to replicate the main experiments of the paper. Within the subfolder for each task there are:
    • Time-invariant features
      • s.npz N × d N \times d sparse matrix containing time-invariant features.
      • s.feature_names.json: the string names of the d d time-invariant features.
      • s.feature_aliases.json: the alias mapping of time-invariant features.
    • Time-dependent features
      • X.npz N × L × D N \times L \times D  sparse tensor containing time-dependent features.
      • X.feature_names.json: the string names of the  D D time-dependent features. 
      • X.feature_aliases.json: the alias mapping of time-dependent features. 

The cohort numbers and dimensionalities of extracted features are summarized below.

In-hospital mortality (48h) 8,577 96 7,307
ARF (4h) 15,873 98 4,045
ARF (12h) 14,174 96 4,816
Shock (4h) 19,342 98 4,522
Shock (12h) 17,588 97 5,500
eICU N d D
In-hospital mortality (48h) 77,066 146 2,382
ARF (4h) 138,840 717 5,854
ARF (12h) 122,619 119 2,713
Shock (4h) 164,333 770 6,314
Shock (12h) 144,725 128 2,946

Usage Notes

See the included jupyter notebook for an example of loading the features/labels. Please also refer to the project GitHub repository for implementation of experiments in the original paper that contains code to load the features/labels and train various machine learning models [6].

To load the features, you need python and the sparse package [7]. 

import sparse
import json

s = sparse.load_npz('features/{task}/s.npz').todense()
X = sparse.load_npz('features/{task}/X.npz').todense()

s_feature_names = json.load(open('features/{task}/s.feature_names.json', 'r'))
X_feature_names = json.load(open('features/{task}/X.feature_names.json', 'r'))

To load the labels, use pandas or an alternative csv reader [8]:

import pandas as pd

df_pop = pd.read_csv('population/{task}.csv')

For each task, df_pop, s, and X all have the same length N N corresponding to the number of ICU stays for the study population of that task; each row corresponds to information pertaining to an ICU stay.

Release Notes

Current Version

The current version of this dataset release is v1.0.0.


Initial release. 


This work was supported by the Michigan Institute for Data Science (MIDAS); the National Science Foundation award number IIS-1553146; the National Heart, Lung, and Blood Institute grant number R25HL147207; and the National Library of Medicine grant number R01LM013325. The views and conclusions in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the Michigan Institute for Data Science; the National Science Foundation; the National Heart, Lung and Blood Institute; nor the National Library of Medicine.

The authors would also like to thank the members of the MLD3 group at the University of Michigan for helpful discussion regarding this work. 

Conflicts of Interest

The authors have no conflicts of interest to declare. 


  1. Tang, S., Davarmanesh, P., Song, Y., Koutra, D., Sjoding, M. W., & Wiens, J. (2020). Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. Journal of the American Medical Informatics Association, 27(12), 1921-1934.
  2. Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
  3. Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet.
  4. Pollard, T., Johnson, A., Raffa, J., Celi, L. A., Badawi, O., & Mark, R. (2019). eICU Collaborative Research Database (version 2.0). PhysioNet.
  5. Pollard, T., Johnson, A., Raffa, J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data 5, 180178 (2018).
  6. FIDDLE code repository. [Accessed: 20 April 2021]
  7. Sparse package for Python. [Accessed: 20 April 2021]
  8. The Pandas Development Team. pandas-dev/pandas: Pandas. Zenodo. Feb 2020.

Parent Projects
MIMIC-III and eICU-CRD: Feature Representation by FIDDLE Preprocessing was derived from: Please cite them when using this project.

Access Policy:
Only PhysioNet credentialed users who sign the specified DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Corresponding Author
You must be logged in to view the contact information.