Database Credentialed Access

MIMIC-IV-Ext-Instr: A Dataset of 450K+ EHR-Grounded Instruction-Following Examples

Zhenbang Wu Anant Dadu Mike Nalls Faraz Faghri Jimeng Sun

Published: Sept. 9, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Wu, Z., Dadu, A., Nalls, M., Faghri, F., & Sun, J. (2025). MIMIC-IV-Ext-Instr: A Dataset of 450K+ EHR-Grounded Instruction-Following Examples (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/e5bq-pr14

Additionally, please cite the original publication:

Wu, Z., Dadu, A., Nalls, M., Faghri, F., & Sun, J. (2024). Instruction tuning large language models to understand electronic health records. In The thirty-eight conference on neural information processing systems datasets and benchmarks track. Retrieved from https://openreview.net/forum?id=Dgy5WVgPd2

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Large language models (LLMs) have shown impressive capabilities in solving a wide range of tasks based on human instructions. However, developing a conversational AI assistant for electronic health record (EHR) data remains challenging due to the lack of large-scale instruction-following datasets. To address this, we present MIMIC-IV-Ext-Instr, a dataset containing over 450K open-ended, instruction-following examples generated using GPT-3.5 on a HIPAA-compliant platform. Derived from the MIMIC-IV EHR database, MIMIC-IV-Ext-Instr spans a wide range of topics and is specifically designed to support instruction-tuning of general-purpose LLMs for diverse clinical applications.


Background

Despite the potential benefits in supporting clinical decision making and care coordination, EHR systems also lead to physician burnout due to challenges in navigating the user interface, the large volume of data that needs to be reviewed for each medical decision, and the extra clerical tasks directed to physicians [1, 2, 3]. Advances in LLMs offer an opportunity to streamline EHR processes and ease the load on healthcare providers [4, 5]. However, developing a conversational AI assistant specifically for EHR data remains a significant challenge due to the lack of large-scale instruction-following data.

LLMs are typically fine-tuned on large-scale instruction-following datasets to understand user instructions and perform a variety of tasks [6]. These datasets are created using manually defined templates or with the assistance of LLMs. The construction process requires substantial efforts and becomes even more complex when data must be paired with patient EHRs. Thus, most prior works mainly focus on the clinical notes [7, 8, 9], as generating instruction-following data from free text is comparatively straightforward. However, a substantial amount of information exists solely within structured EHR data (e.g., relational tables). Although some question-answering (QA) datasets are based on structured EHR data [10, 11, 12], they mainly focus on factoid extraction and lack alignment with real-world clinical decision-making, which often requires complex reasoning. Moreover, existing datasets are limited in size [13], ranging from thousands to tens of thousands of examples, which is insufficient for effective LLM instruction tuning.


Methods

To address this, we introduce MIMIC-IV-Ext-Instr, a dataset of over 450K EHR-grounded instruction-following examples based on the publicly available MIMIC-IV EHR database [14]. This dataset is divided into the following two parts.

Schema Alignment Subset

A set of 350K QA pairs was constructed from over 100 templates and subsequently paraphrased using GPT-3.5 (We used Azure's HIPAA-compliant platform in accordance with PhysioNet's regulations). These questions query various information from the structured EHR data, such as patient demographics, diagnoses, treatment histories, and test results. They are designed to train LLMs on the ability to navigate and extract specific information from the complex and heterogeneous EHR data.

Specifically, for each type of clinical event, we developed a set of question templates (e.g., “which {measurement_name} performed on the {specimen_name} were abnormal {time_period}?”). These templates query diverse information from patient EHR data in the MIMIC-IV database [14]. Each question template is paired with a manually crafted Python script that extracts the ground-truth answer from the corresponding EHR table. Given a patient's EHR data, we randomly select a template to generate a corresponding question-answer pair (e.g., Q: “Which Blood Gas measurement on the Blood specimen were abnormal at the 650.05 hour?” A: “Calculated Total CO2, pCO2, pO2.”). Since the generated QA pairs all follow some fixed template, which limits their effectiveness for training LLMs to interpret diverse instructions, we leveraged GPT-3.5 to paraphrase the generated QA pairs without altering their meanings (e.g., Q: “Show me the abnormal blood gas measurements at the 650.05 hours?” A: “The calculated total CO2, pCO2, pO2 were abnormal.”) In this way, we generated 350K QA pairs focused on information retrieval. This set of instruction-tuning QA pairs primarily asks about the extraction and aggregation of specific factual information from EHR data, serving as a foundational step for enabling LLMs to perform deeper clinical reasoning on EHR data.

Clinical Reasoning Subset

Another set of 100K QA pairs was generated from discharge summaries with GPT-3.5. Discharge summaries capture the complexities of patient cases and the rationales behind medical decisions. This subset challenges LLMs to go beyond simple fact extraction, engaging in deeper clinical reasoning tasks such as understanding the progression of a patient's condition, predicting possible complications, and suggesting appropriate follow-up actions.

Specifically, we prompted GPT-3.5 to generate questions and answers that resemble those doctors might ask in real-world clinical settings. We also manually created few-shot examples in the prompt to demonstrate how to generate high-quality QA pairs. In this way, we generated another 100K QA pairs to equip the model with clinical reasoning abilities.

Note

The templates, prompts, and code for generating the instruction-following examples are is publicly available [15].


Data Description

The MIMIC-IV-Ext-Instr dataset consists of two primary JSONL files containing instruction-following examples:

  • qa_event.jsonl:
    • Contains 356,968 question-answer pairs from the schema-alignment subset.
    • Each row is a JSON line comprising:
      • hadm_id: Hospital admission ID
      • q: Question
      • a: Answer
      • event_type: Original MIMIC-IV table from which the question is derived
  • qa_note.jsonl:
    • Contains 113,107 question-answer pairs from the clinical-reasoning subset.
    • Each row is a JSON line comprising:
      • hadm_id: Hospital admission ID
      • q: Question
      • a: Answer

These files encompass all instruction-following examples used for model training, validation, and testing. The dataset is divided into train, validation, and test splits, provided in the following CSV files:

  • cohort.csv:
    • Contains the entire patient cohort used for model training, validation, and testing.
    • Columns include:
      • subject_id: Patient ID
      • hadm_id: Hospital admission ID
      • stay_id: ICU stay ID
      • hadm_intime: Hospital admission date and time
      • hadm_outtime: Hospital discharge date and time
      • hadm_los: Length of hospital admission (in hours)
      • stay_intime: ICU admission date and time
      • stay_outtime: ICU discharge date and time
      • stay_los: Length of ICU stay (in hours)
      • len_selected: Total number of events
  • cohort_train.csv:
    • Contains the patient cohort used for model training, with 38,299 patients.
    • It has the same columns as cohort.csv.
  • cohort_val.csv:
    • Contains the patient cohort used for model validation, with 4,782 patients.
    • It has the same columns as cohort.csv.
  • cohort_test.csv:
    • Contains the patient cohort used for model testing, with 4,309 patients.
    • It has the same columns as cohort.csv.

Due to the computational cost of using GPT for model response evaluation, a subset of 100 patients from the test set is provided:

  • The cohort is included in cohort_test_subset.csv.
  • The corresponding QA pairs are included in qa_test_subset.jsonl.

Usage Notes

Potential Applications

The MIMIC-IV-Ext-Instr dataset supports a wide range of clinical NLP applications. The schema alignment subset enables training and evaluation of models on structured EHR querying tasks, useful for building EHR copilots and clinical retrieval systems. The clinical reasoning subset challenges models to interpret unstructured discharge summaries, supporting tasks like summarization, complication prediction, and care planning. Together, these subsets provide a foundation for developing instruction-tuned LLMs capable of both precise information extraction and deeper clinical reasoning—advancing the reliability and utility of AI systems in healthcare.

Limitations

The MIMIC-IV-Ext-Instr dataset was generated using hand-crafted templates and augmented with GPT-3.5, which introduces potential sources of error and noise. These inherent inaccuracies in the data generation process could bias model training and distort the understanding of real-world clinical scenarios. As a result, models trained on this dataset must be carefully evaluated for both performance and interpretability before deployment in clinical settings to ensure they support, rather than hinder, medical decision-making.

A significant limitation of this work is the lack of expert validation in both the dataset construction and the evaluation of model outputs. Without domain expert oversight, there is a risk that clinically important nuances may be misrepresented or overlooked. Addressing this in future iterations is essential to ensure the clinical reliability and safety of models trained using instruction-following data.

It is important to emphasize that the primary goal of this work is to expand the scale of instruction-following examples grounded in EHR data, rather than to provide expert-curated annotations. For more rigorous evaluation or safety-critical use cases, one may refer to smaller but expert-validated datasets such as MedAlign [13].

Future work may focus on improving the quality and realism of generated data through more robust prompting strategies, incorporation of expert-in-the-loop processes, and integration of real clinical feedback. Such efforts would help enhance the applicability and trustworthiness of instruction-tuned models in healthcare.

Additional Resources

The accompanying code is publicly available [15]. For more detailed information about MIMIC-IV-Ext-Instr, please refer to our associated publication [16].


Release Notes

1.0.0 - Initial Release


Ethics

We used GPT-3.5 via Azure's HIPAA-compliant platform in accordance with PhysioNet's data usage regulations. All data generation and model training were performed in a secure environment to ensure compliance with privacy standards and to safeguard sensitive health information.


Acknowledgements

This work was supported by NSF award SCH-2205289, SCH-2014438, and IIS-2034479. We thank the patients and their families who contributed to this research. This research was supported in part by the Intramural Research Program of the National Institute on Aging (NIA) and National Institute of Neurological Disorders and Stroke (NINDS), both part of the National Institutes of Health, within the Department of Health and Human Services project number ZIAAG000534.


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Melnick ER, Dyrbye LN, Sinsky CA, Trockel M, West CP, Nedelec L, Tutty MA, Shanafelt T (2020). "The association between perceived electronic health record usability and professional burnout among US physicians". Mayo Clin Proc. 95(3):476–487. PMID: 31735343. doi:10.1016/j.mayocp.2019.09.024.
  2. Tajirian T, Stergiopoulos V, Strudwick G, Sequeira L, Sanches M, Kemp J, Ramamoorthi K, Zhang T, Jankowicz D (2020). "The influence of electronic health record use on physician burnout: Cross-sectional survey". J Med Internet Res. 22(7):e19274. PMID: 32673234. doi:10.2196/19274.
  3. DeChant PF, Acs A, Rhee KB, Boulanger TS, Snowdon JL, Tutty MA, Sinsky CA, Thomas Craig KJ (2019). "Effect of organization-directed workplace interventions on physician burnout: A systematic review". Mayo Clin Proc Innov Qual Outcomes. 3(4):384–408. PMID: 31993558. doi:10.1016/j.mayocpiqo.2019.07.006.
  4. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, et al. (2023). "Large language models encode clinical knowledge". Nature. 620(7972):172–180. PMID: 37438534. doi:10.1038/s41586-023-06291-2.
  5. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, Clark K, Pfohl S, Cole-Lewis H, Neal D, et al. (2025). "Toward expert-level medical question answering with large language models". Nat Med. 31:943–950. doi:10.1038/s41591-024-03423-7.
  6. Wei J, Bosma M, Zhao V, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV (2022). "Finetuned language models are zero-shot learners". In: International Conference on Learning Representations. Available from: https://openreview.net/forum?id=gEZrGCozdqR
  7. Kweon S, Kim J, Kwak H, Cha D, Yoon H, Kim K, Yang J, Won S, Choi E (2024). "EHRNoteQA: An LLM benchmark for real-world clinical practice using discharge summaries". arXiv preprint. arXiv:2402.16040 [cs.CL]. Available from: https://arxiv.org/abs/2402.16040
  8. Lehman E, Lialin V, Legaspi KE, Sy AJ, Pile PT, Alberto NR, Ragasa RR, Puyat CV, Taliño MK, Alberto IR, et al. (2022). "Learning to ask like a physician". In: Proceedings of the 4th Clinical Natural Language Processing Workshop. Seattle, WA: Association for Computational Linguistics; p. 74–86. doi:10.18653/v1/2022.clinicalnlp-1.8. Available from: https://aclanthology.org/2022.clinicalnlp-1.8/
  9. Yue X, Zhang XF, Yao Z, Lin S, Sun H (2021). "CliniQG4QA: Generating diverse questions for domain adaptation of clinical question answering". arXiv preprint. arXiv:2010.16021 [cs.CL]. Available from: https://arxiv.org/abs/2010.16021
  10. Pampari A, Raghavan P, Liang J, Peng J (2018). "emrQA: A large corpus for question answering on electronic medical records". In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; p. 2357–2368. doi:10.18653/v1/D18-1258. Available from: https://aclanthology.org/D18-1258/
  11. Lee G, Hwang H, Bae S, Kwon Y, Shin W, Yang S, Seo M, Kim JY, Choi E (2022). "EHRSQL: A practical text-to-SQL benchmark for electronic health records". In: Advances in Neural Information Processing Systems. Vol. 35. p. 15589–15601. Available from: https://proceedings.neurips.cc/paper_files/paper/2022/file/643e347250cf9289e5a2a6c1ed5ee42e-Paper-Datasets_and_Benchmarks.pdf
  12. Tang X, Zou A, Zhang Z, Li Z, Zhao Y, Zhang X, Cohan A, Gerstein M (2024). "MedAgents: Large language models as collaborators for zero-shot medical reasoning". In: Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics; p. 599–621. doi:10.18653/v1/2024.findings-acl.33. Available from: https://aclanthology.org/2024.findings-acl.33/
  13. Fleming SL, Lozano A, Haberkorn WJ, Jindal JA, Reis EP, Thapa R, Blankemeier L, Genkins JZ, Steinberg E, Nayak A, et al. (2023). "MedAlign: A clinician-generated dataset for instruction following with electronic medical records". arXiv preprint. arXiv:2308.14089 [cs.CL]. Available from: https://arxiv.org/abs/2308.14089
  14. Johnson AEW, Bulgarelli L, Shen L, et al. (2023). "MIMIC-IV, a freely accessible electronic health record dataset". Sci Data. 10:1. doi:10.1038/s41597-022-01899-x.
  15. https://github.com/zzachw/llemr
  16. Wu Z, Dadu A, Nalls M, Faghri F, Sun J (2024). "Instruction tuning large language models to understand electronic health records". In: Advances in Neural Information Processing Systems. Vol. 37. p. 54772–54786. Available from: https://proceedings.neurips.cc/paper_files/paper/2024/file/62986e0a78780fe5f17b495aeded5bab-Paper-Datasets_and_Benchmarks_Track.pdf

Parent Projects
MIMIC-IV-Ext-Instr: A Dataset of 450K+ EHR-Grounded Instruction-Following Examples was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files