Database Open Access
MIMIC-IV demo data in the Medical Event Data Standard (MEDS)
Robin Philippus van de Water , Ethan Steinberg , Michael Wornow , Patrick Rockenschaub , Matthew McDermott
Published: Sept. 29, 2025. Version: 0.0.1
When using this resource, please cite:
(show more options)
van de Water, R. P., Steinberg, E., Wornow, M., Rockenschaub, P., & McDermott, M. (2025). MIMIC-IV demo data in the Medical Event Data Standard (MEDS) (version 0.0.1). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/t2y8-ea41
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
This dataset is an automated ETL conversion of the MIMIC-IV Clinical Database Demo into the Medical Event Data Standard (MEDS). MEDS is a data schema for storing streams of medical events such as those sourced from Electronic Health Records or claims records. MEDS is intentionally a minimal standard, designed for maximum interoperability across datasets, existing tools, and model architectures. By providing a simple standardization layer between datasets and model-specific code, MEDS is intended to help make machine learning research for EHR data more reproducible, robust, computationally performant, and collaborative.
Background
The Medical Event Data Standard (MEDS) [1] is a data schema for storing streams of medical events such as those sourced from either Electronic Health Records or claims records. MEDS is intentionally a minimal standard, designed for maximum interoperability across datasets, existing tools, and model architectures.The MEDS schema is simple and scalable and supports a growing suite of tools that can help accelerate the development of machine learning models. Our community is actively expanding the ecosystem to include public and private datasets, with private datasets remaining private, but helping ensure frictionless reproducibility from models trained on private data to other settings, machine learning models, and clinically-relevant benchmark tasks. To better facilitate experimenting with MEDS and improve understanding of the format as a whole, we provide an official demo dataset. Likewise, the MIMIC-IV demo has been converted to the OMOP (Observational Medical Outcomes Partnership) standard [2, 3].
Methods
We created a pipeline that extracts the MIMIC-IV dataset into the MEDS format. The dataset provided here was generated using this pipeline, specifically using v0.0.6 of the MEDS Python package available on the Python Packaging Index, PyPi (https://pypi.org/project/MIMIC-IV-MEDS/0.0.6/).
Generating the MEDS demo
The dataset was generated by first downloading the MIMIC-IV demo dataset (v2.2) [4] using the following commands:
pip install MIMIC_IV_MEDS==0.0.6 export DATASET_DOWNLOAD_USERNAME=$PHYSIONET_USERNAME export DATASET_DOWNLOAD_PASSWORD=$PHYSIONET_PASSWORD
export ROOT_OUTPUT_DIR=path/to/your/desired/directory
The following command was then run to convert the data to the MEDS format:
MEDS_extract-MIMIC_IV root_output_dir=$ROOT_OUTPUT_DIR do_demo=True
The MEDS package uses MEDS-Transforms [6] to manage the data transformation. MEDS-Transforms can be adapted to create ETLs from any dataset to MEDS (see, for example, the list at: https://github.com/Medical-Event-Data-Standard#datasets--benchmarks).
While MEDS supports sharding, the demo dataset comprises a single shard due to the limited size of the dataset. Sharding is useful for larger datasets to enable tasks to be parallelized, but is unnecessary in this case.
Data Description
The MIMIC-IV demo dataset - and this MEDS transform - contains routinely collected electronic health record data for 100 critical care patients (referred to as "subjects"). The core MEDS dataset is contained within a data
folder, which is split into three subsets:
data/train
(80 subjects)data/tuning
, also known as validation (10 subjects)data/held_out
, also known as test (10 subjects)
Within each folder there is a Parquet file containing the event-based patient records. The MEDS data schema is simple, comprising of only 3 required columns: subject_id
, time
, code
, and two optional columns: numeric_value
and text_value
. These data are supplemented by three metadata
files:
metadata/codes.parquet
: details on the unique codes used in the dataset and their parent codes.metadata/dataset.json
: general information about the ETL and datasetmetadata/subject_splits.parquet
: specifies which subjects appear in the training, tuning, and held out subsets.
codes.parquet
This file contains metadata about the code
vocabulary featured in the data files. It contains the following three columns:
code
: The code value, of typestring
.description
: An optional free-text, human readable description of the code, of typestring
.parent_codes
: An optional list of links to parent codes in this dataset or external ontology nodes associated with this code, of typelist[string]
.
dataset.json
This file contains metadata about the dataset itself, including the following:
dataset_name
: The name of the dataset, of typestring
.dataset_version
: The version of the dataset, of typestring
. Ensuring the version numbers used are meaningful and unique is important for reproducibility, but is ultimately not enforced by the MEDS schema and is left to the dataset creator.etl_name
: The name of the ETL process used to generate the dataset, of typestring
.etl_version
: The version of the ETL process used to generate the dataset, of typestring
.meds_version
: The version of the MEDS standard used to generate the dataset, of typestring
.created_at
: The timestamp at which the dataset was created, of typestring
in ISO 8601 format (note that this is not an official timestamp type, but rather a string representation of a timestamp as this is a JSON file).
subject_splits.parquet
This file maps subject IDs to pre-defined splits of the dataset, such as training, hyperparameter tuning, and held-out sets. In the MEDS splits file, each row contains a subject_id
(int64
) and a split
(string
) column, where split
is the name of the split in which that subject lives. For the three canonical AI/ML splits, MEDS uses the following split names:
train
: The training split. This data can be used for any purpose during model building, and in supervised training labels over this split will be visible to the model.tuning
: The hyperparameter tuning split. This split is sometimes called the "dev" or "val" split in other contexts. This data can be used for tuning hyperparameters or for training of the final model, but should not be used for final evaluation of model performance. Users may choose to merge this with the training split then re-shuffle themselves if they need more splits or a different split ratio. Not all datasets will specify this split, as it is optional.held_out
: The final evaluation held-out split. This data should not be used for training or tuning, and should only be used for final evaluation of model performance. This split is sometimes called the "test" split in other contexts. No data about these patients should be assumed to be available during data pre-processing, training, or tuning.
For more information on the MEDS data structure, we suggest referring to the MEDS documentation [7].
Usage Notes
The data is in the .parquet
format. Use compatible libraries, such as Pyarrow or Polars for Python to read the data. Alternatively, you can inspect the data with any Parquet reader or viewer. In addition, there is a growing set of MEDS-compliant tools, such as:
- MEDS-Reader: A software package for efficient EHR processing [8].
- MEDS-Transforms: A set of functions and scripts for transforming data to and from MEDS [6].
- MEDS-Tab: A software package designed for automated tabularize and prepare MEDS data [9].
- MEDS-Inspect: A software package to interactively inspect MEDS data [10].
- ACES: A software package and configuration language for reproducible extraction of task cohorts [11].
For more information, tutorials, and compatible tools, see the MEDS documentation [7].
Release Notes
Version 0.1.1: Initial release. Added dataset in MEDS 0.3.3 with MIMIC-IV ETL 0.0.6
Ethics
The MIMIC-IV Medical Event Data Standard (MEDS) demo dataset was derived from the MIMIC-IV Clinical Database Demo (V2.2).
Acknowledgements
We acknowledge the work of the MEDS and MEDS-DEV community:
MEDS: Edward Choi, Ethan Steinberg, Jason A. Fries, Jungwoo Oh, Matthew B. A. McDermott, Michael Wornow, Nigam H. Shah, Patrick Rockenschaub, Robin P. van de Water, Tom J. Pollard
MEDS-DEV: Teya S. Bergamaschi, Jeffrey N. Chiang, Edward Choi, Young Sang Choi, Jason A. Fries, Jack Gallifant, Raffaele Giancotti, Xinzhuo Jiang, Hyewon Jeong, Vincent Jeanselme, Shalmali Joshi, Alistair Johnson, Apara Kashyap, Kiril V. Klein, Aleksia Kolo, Yuta Kobayashi, Ryan C. King, Simon A. Lee, Yanwei Li, Matthew B. A. McDermott, Maria E. Montgomery, Mikkel Odgaard, Jungwoo Oh, Nassim Oufattole, Chao Pang, Tom J. Pollard, Pawel Renc, Patrick Rockenschaub, Nigam H. Shah, Martin Sillesen, Ethan Steinberg, Kamilė Stankevičiūtė, Robin P. van de Water, Michael Wornow, Justin Xu, Mads Nielsen.
Conflicts of Interest
No conflicts of interest to report.
References
- MEDS Working Group: Arnrich, B., Choi, E., Fries, J. A., McDermott, M. B. A., Oh, J., Pollard, T., Shah, N., Steinberg, E., Wornow, M., & van de Water, R. (2024). Medical Event Data Standard (MEDS): Facilitating machine learning for health. In ICLR 2024 Workshop on Learning from Time Series For Health. https://openreview.net/forum?id=IsHy2ebjIG
- Kallfelz, M., Tsvetkova, A., Pollard, T., Kwong, M., Lipori, G., Huser, V., Osborn, J., Hao, S., & Williams, A. (2021). MIMIC-IV demo data in the OMOP Common Data Model (version 0.9). PhysioNet. https://doi.org/10.13026/p1f5-7x35.
- Hripcsak, G., Duke, J.D., Shah, N.H., et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015;216:574-8. PMC4815923. https://pubmed.ncbi.nlm.nih.gov/26262116/
- Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV Clinical Database Demo (version 2.2). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/dp1f-ex47
- https://github.com/Medical-Event-Data-Standard/MIMIC_IV_MEDS [Accessed 2025-07-08]
- https://github.com/mmcdermott/MEDS_extract/ [Accessed 2025-07-08]
- MEDS website: https://medical-event-data-standard.github.io/
- MEDS Reader: https://meds-reader.readthedocs.io/en/latest/
- MEDS Tab: https://meds-tab.readthedocs.io/en/latest/
- MEDS Inspect: https://github.com/rvandewater/MEDS-Inspect
- ACES: Justin Xu and Jack Gallifant and Alistair E. W. Johnson and Matthew B. A. McDermott. ACES: Automatic Cohort Extraction System for Event-Stream Datasets (2025). https://arxiv.org/abs/2406.19653
Parent Projects
Access
Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.
License (for files):
Open Data Commons Open Database License v1.0
Discovery
DOI (version 0.0.1):
https://doi.org/10.13026/t2y8-ea41
DOI (latest version):
https://doi.org/10.13026/y9xz-1347
Topics:
ehr
critical care
electronic health record
mimic
machine learning
meds
medical event data standard
Project Website:
https://github.com/Medical-Event-Data-Standard
Corresponding Author
Files
Access the files
-
Download the files using your terminal:
wget -r -N -c -np https://physionet.org/files/mimic-iv-demo-meds/0.0.1/
Name | Size | Modified |
---|---|---|
Parent Directory | ||
codes.parquet (download) | 381.0 KB | 2025-05-09 |
dataset.json (download) | 184 B | 2025-05-09 |
subject_splits.parquet (download) | 1.3 KB | 2025-05-09 |