Name: MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis
Published: Oct. 8, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Farieda Gaber , Altuna Akalin

Published: Oct. 8, 2025. Version: 1.0.2

When using this resource, please cite: (show more options)
Gaber, F., & Akalin, A. (2025). MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis (version 1.0.2). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/stnm-qx35

MLA	Gaber, Farieda, and Altuna Akalin. "MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis" (version 1.0.2). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/stnm-qx35
APA	Gaber, F., & Akalin, A. (2025). MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis (version 1.0.2). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/stnm-qx35
Chicago	Gaber, Farieda, and Altuna Akalin. "MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis" (version 1.0.2). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/stnm-qx35
Harvard	Gaber, F., and Akalin, A. (2025) 'MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis' (version 1.0.2), PhysioNet. RRID:SCR_007345. Available at: https://doi.org/10.13026/stnm-qx35
Vancouver	Gaber F, Akalin A. MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis (version 1.0.2). PhysioNet. 2025. RRID:SCR_007345. Available from: https://doi.org/10.13026/stnm-qx35

Additionally, please cite the original publication:

Gaber F, Shaik M, Allega F, Bilecz AJ, Busch F, Goon K, Franke V, Akalin A. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit Med. 2025.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Accurate medical decision-making is critical for both patients and clinicians. Patients often find it difficult to interpret their symptoms, determine their severity, and select the right specialist to see. At the same time, clinicians face challenges in integrating complex patient data to make timely, accurate diagnoses. Recent advancements in large language models (LLMs) offer the potential to bridge this gap by supporting decision-making for both patients and healthcare providers. To support the development of LLMs in healthcare, we have curated and extended the MIMIC-IV dataset.

We extracted, processed, and extended the MIMIC-IV dataset, which resulted in 9,150 real-world medical cases with various pathologies. From this larger set 2,200 were further refined to meet the specific needs of our task. This extended dataset is designed to evaluate and improve LLMs' ability to assist with triage, specialist referral, and diagnosis, using critical patient information such as history of present illness, vitals signs and other relevant data. This allows for the development of models that can help both clinicians and patients in making better-informed decisions, ultimately improving healthcare delivery and patient outcomes.

Background

Clinical decision-making is a fundamentally complex process that requires clinicians to apply their knowledge and experience [1] while considering numerous factors and integrating vast amounts of data including symptoms, vital signs, patient medical history, and various examinations, to make accurate and timely diagnosis. The ability to correctly interpret this information and make well-founded decisions is crucial for improving patient outcomes.

Recent advancements in large language models (LLMs) have demonstrated significant potential to transform various fields, including clinical decision-support [2, 3]. While LLMs have performed well in structured environments, such as medical licensing exams and clinical vignettes [4, 5], many studies still use simplified formats, such as a set of binary or multiple-choice options testing human competencies within particular domains [6–9]. However, in real-world clinical decision-making, clinicians are frequently faced with vague or unclear symptoms, incomplete information, without predefined options. Instead, they must rely on their clinical judgement and experience to navigate uncertainty and arrive at a diagnosis.

Current LLM research in healthcare focuses on diagnosing specific diseases or particular medical specialties, which showed valuable insights [10–16]. To complement these efforts, we created a dataset using MIMIC-IV hosp, MIMIC-IV-ED, and MIMIC-IV-note to test LLM workflows in various clinical decision support tasks like triage, referral, and diagnosis, simulating real-world decision-making scenarios. Additionally, the curated dataset includes additional data such as pain level, chief complaint, ICD codes, and test results. This dataset offers the flexibility to evaluate LLMs for broader clinical decision-making tasks or focus on specific diseases, depending on the research objectives. This allows for a comprehensive evaluation of LLMs in assisting clinicians in real-world, complex medical environments.

Methods

We processed and created our curated dataset MIMIC-IV-Ext clinical decision using the MIMIC-IV ED dataset (version 2.2) [17, 18], MIMIC-IV Note (version 2.2) [17, 19], and MIMIC-IV datasets (version 3.1) [17, 20, 21], to support clinical decision-making for referral, triage, and diagnosis in an emergency department (ED) setting.

We started by merging the discharge (199,162 cases), triage (286,704 cases), ed_stays (420,587 cases), diagnostics (785,778 cases), and patients (364,627 subjects) tables on stay_id and subject_id. From the MIMIC-IV-ED dataset, we extracted key clinical and demographic information about each hospital stay. From the triage table, we took the vital signs, such as temperature, heart rate, and other essential vital metrics. Additionally, we extracted acuity, which represents the triage level assigned to each patient and is based on the severity of their condition upon arrival and the chief complaint, which is the reason for the patient's presentation to the ED.

From the ed_stays table, we obtained patient-related information, including gender, race, arrival transportation method, and disposition after their emergency department care. From the diagnostics table, we extracted the ICD diagnosis codes, versions, titles, and the associated sequence number (seq_num), which is a pseudo-order of the relevance of the medical diagnosis captured through the ICD code. We kept only one stay per patient and only included the cases where the sequence number (seq_num) equaled 1, extracting only the most relevant medical diagnosis based on the ICD code. From the MIMIC-IV hospital module, we utilized the patients table to further extend patient demographics with age information.

This merging resulted in 31,150 cases. We then dropped entries with missing acuity, reducing the dataset to 29,129 cases. From the MIMIC-IV-Note, the discharge table provides free-text discharge notes, which contain details about the patient's clinical course during their hospital stay. From these free-text notes, we extracted the history of present illness (HPI), the performed tests, medications, and diagnoses for each patient. From the diagnoses text, two separate lists were created: one for the primary diagnosis and, when applicable, another for the secondary diagnoses.

We excluded patients whose notes contained no documented performed tests, resulting in 26,224 cases. Next, we removed records lacking a history of present illness (HPI), resulting in 25,650 cases. Patients with an HPI string length shorter than the 80th percentile of all HPI lengths (between 50 and 2,000 characters) were selected. The same approach was applied to the test descriptions, using the 90th percentile (up to 3,000 characters). Patients with longer strings were excluded to minimize the likelihood of including information beyond just the symptoms. This also helped to better reflect the way patients typically describe their symptoms, which is usually brief rather than lengthy. This narrowed the dataset to 18,067 cases. The HPI was further processed to exclude any information that might be beyond the description of the patient's symptoms. Additionally, cases where the HPI mentions anything related to the emergency department (ED), emergency room, or similar were excluded, yielding 9,326 cases. Patients who died during the encounter were then excluded, leaving 9,306 cases. Finally, we removed any records whose primary or secondary diagnosis fields contained narrative beyond diagnostic codes and limited the total number of diagnoses per case to 15.

The final submitted datasets consist of 9,149 unique cases of patients visiting the ED, with each case including vital signs, triage level, patient demographics, ICD information, HPI, tests, medications, and diagnoses, which are divided into primary diagnoses and, where applicable, secondary diagnoses. Additionally, the data include information about pain level, chief complaint, arrival transportation, and disposition. All of these merge, filter, and extraction steps were carried out entirely via reproducible Python scripts published on GitHub [22], without employing any large-language models for data extraction, curation, or transformation.

We extended the MIMIC-IV-Ext dataset by creating a specialty referral table. Using the primary diagnosis list, we applied a large language model (Claude 3.5 Sonnet) to automatically determine the appropriate medical specialty for patient referral based on each diagnosis for each patient.

To validate our LLM-generated specialty referral table, we asked four clinicians to review a subset of the predictions by checking whether the predicted specialty was relevant to the corresponding diagnosis. This subset consisted of 419 unique patient cases, which resulted in 566 specialty assignments because some patients had more than one diagnosis requiring referral. The dataset was divided into two parts, with one part assigned to Clinicians 1 and 2 and the other part assigned to Clinicians 3 and 4. This arrangement allowed for independent verification within each pair, increasing the objectivity of the assessments. Across all reviewers, 81.5% of predictions made by the LLM were rated as correct, and when partially correct and reasonable but suboptimal ratings are included, overall acceptability reached 97.0%.

We assessed interrater agreement within each clinician pair. When using a binary definition of “acceptability,” where we combined correct, partially correct, and reasonable but suboptimal vs. incorrect, agreement was 90.8% for Clinicians 1 and 2 and 98.0% for Clinicians 3 and 4 (mean 94.4%). Additionally, we examined a three-level categorization where ratings were defined as correct, acceptable (partially correct or reasonable but suboptimal), and incorrect, which resulted in agreement rates of 79.1% and 69.3% (mean 74.2%). This highlights the variations in human evaluations, which are likely influenced by differences in experience and individual judgment, and therefore this evaluation provides a way to capture broader consensus.

Data Description

The data are maintained as comma-separated value (CSV) files and are centered around the stay_id of the patients, which makes it possible to link between the different files, as stay_id can be found in each of the files.

vital_signs.csv: The file contains the stay_id, subject_id, and hadm_id. Additionally, the vital signs, also referred to as initial vitals (temperature, heart rate, respiration rate, oxygen saturation, diastolic blood pressure, and systolic blood pressure), were combined into a single text string, where each vital sign was followed by its corresponding value.

patient_demographics.csv: The gender, race, and age of each patient were combined into a single text string.

initial_assessment_info.csv: This file contains the acuity, also referred to as triage level, pain, chief complaint, arrival transport, disposition, and the ICD information consisting of ICD code, ICD title, and ICD version.

clinical_data.csv: The file consists of the free-text discharge notes, the extracted HPI, tests, medication, and diagnosis. It also includes the list of primary and secondary diagnoses. Additionally, the file contains the specialty referral generated from the LLM.

A MIMIC-IV-Ext dataset was created to evaluate large language model workflows in clinical decision support for referral, triage, and diagnosis. Additionally, three files (triage_level.csv, specialty_referral.csv, diagnosis.csv) were used for the research question, containing different information from the files above.

All four files contain the stay_id, HPI from clinical_data.csv, the vital signs from vital_signs.csv, and patient information from patient_demographics.csv. Each file includes additional information as follows:

triage_level.csv: This file contains the triage level/acuity from initial_assessment_info.csv.

specialty_referral.csv: This file contains the created specialty referral with an LLM, noting that it only includes data for the first 2,200 cases.

specialty_referral_clinician_approved.csv: This file contains the approved LLM-generated specialty referrals by the clinicians. The intersected agreement between the paired evaluations resulted in 331 patient cases with correctly generated specialty referrals.

diagnosis.csv: This file contains the diagnoses text, and the primary and secondary diagnoses lists.

Usage Notes

The MIMIC-IV-Ext dataset was created to test and evaluate large language model workflows in clinical decision support for referral, triage, and diagnosis in the ED. The corresponding paper and framework can be found at [23] and [22].

The framework benchmarked multiple LLM workflows on their ability to predict key aspects of clinical care: triage level, also referred to as acuity, in the form of the Emergency Severity Index (ESI) [24], patient to medical specialty referral, and diagnosis. The framework differentiated between two user types: general users, typically patients who provide only personal information and symptoms (HPI), and clinicians in the ED, who can also access vital signs. The framework is designed to handle a wide range of medical conditions, rather than focusing on a specific disease.

This dataset can be further utilized to test LLMs in clinical decision support by adding additional patient information, such as medication, tests, pain level, or ICD information. Additionally, it can be used to focus on specific diseases by extracting relevant cases from the dataset for targeted research.

When using this dataset, there are some limitations that need to be taken into consideration. Each case in the dataset has a sequence number of 1, indicating the most relevant medical diagnosis for that visit. This approach excludes cases with other sequence numbers, potentially omitting additional ICD information that could be clinically significant for understanding the patient’s complete medical profile.

When using the dataset for specific tasks, further post-processing is often required. For the clinical decision support task mentioned above to predict specialty referral, triage level and diagnosis, only the first 2,200 cases from the dataset were utilized and their post-processing steps are detailed at [22]. Additionally, for the specialty referral which was created using an LLM, only the specialties for the first 2,200 cases were generated. Therefore, the dataset for the specialty referral consists only of 2,200 cases and not all the 9,150 cases.

Ethics

Ethical approval for this dataset is included under the existing approval granted for the original MIMIC-IV dataset. As it is derived from MIMIC-IV, no new patient data was collected.

All steps involving LLMs were performed in a secure, isolated AWS environment via AWS PrivateLink to privately connect to AWS-hosted Models.

Acknowledgements

We thank Akalin lab members for comments on the manuscript.

Conflicts of Interest

The authors declare no conflict of interests.

References

Sutriningsih A, Wahyuni CU, Haksama S. Factors affecting emergency nurses’ perceptions of the triage systems. J. Public Health Res. 2020;9:1808.
Ma MD, Ye C, Yan Y, Wang X, Ping P, Chang TS, et al. CliBench: Multifaceted evaluation of Large Language Models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions [Internet]. arXiv [cs.CL]2024;Available from: http://arxiv.org/abs/2406.09923
Testolin A. Can neural networks do arithmetic? A survey on the elementary numerical skills of state-of-the-art deep learning models [Internet]. arXiv [cs.AI]2023;Available from: http://arxiv.org/abs/2303.07735
Abbas A, Rehman MS, Rehman SS. Comparing the performance of popular large language models on the National Board of Medical Examiners sample questions. Cureus 2024;16:e55991.
Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg BS, Klang E. How large language models perform on the United States medical licensing examination: A systematic review [Internet]. 2023;Available from: http://dx.doi.org/10.1101/2023.09.03.23294842
Eriksen AV, Möller S, Ryg J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI [Internet] 2023;Available from: http://dx.doi.org/10.1056/aip2300031
Ueda D, Walston SL, Matsumoto T, Deguchi R, Tatekawa H, Miki Y. Evaluating GPT-4-based ChatGPT’s clinical potential on the NEJM quiz [Internet]. medRxiv2023;2023.05.04.23289493. Available from: http://medrxiv.org/content/early/2023/05/05/2023.05.04.23289493.abstract
Han T, Adams LC, Bressem K, Busch F, Huck L, Nebelung S, et al. Comparative analysis of GPT-4Vision, GPT-4 and open source LLMs in clinical diagnostic accuracy: A benchmark against human expertise [Internet]. medRxiv2023;2023.11.03.23297957. Available from: http://medrxiv.org/content/early/2023/11/08/2023.11.03.23297957.abstract
Harsha N, Lee YT, Sheng Z, Dean C, Richard E, Nicolo F, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine [Internet]. arXiv [cs.CL]2023;Available from: http://arxiv.org/abs/2311.16452
Mori Y, Izumiyama T, Kanabuchi R, Mori N, Aizawa T. Large language model may assist diagnosis of SAPHO syndrome by bone scintigraphy. Mod. Rheumatol. 2024;34:1043–6.
Kwon T, Ong KTI, Kang D, Moon S, Lee JR, Hwang D, et al. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales [Internet]. arXiv [cs.CL]2023;Available from: http://arxiv.org/abs/2312.07399
Daher M, Koa J, Boufadel P, Singh J, Fares MY, Abboud JA. Breaking barriers: can ChatGPT compete with a shoulder and elbow specialist in diagnosis and management? JSES Int. 2023;7:2534–41.
Madadi Y, Delsoz M, Lao PA, Fong JW, Hollingsworth TJ, Kahook MY, et al. ChatGPT assisting diagnosis of neuro-ophthalmology diseases based on case reports. medRxiv 2023;2023.09.13.23295508.
Delsoz M, Madadi Y, Munir WM, Tamm B, Mehravaran S, Soleimani M, et al. Performance of ChatGPT in diagnosis of corneal eye diseases. medRxiv 2023;2023.08.25.23294635.
Sorin V, Kapelushnik N, Hecht I, Zloto O, Glicksberg BS, Bufman H, et al. GPT-4 multimodal analysis on ophthalmology clinical cases including text and images [Internet]. medRxiv. 2023 Nov 24;2023.11.24.23298953. Available from: https://www.medrxiv.org/content/10.1101/2023.11.24.23298953v1?utm_source=chatgpt.com
Hager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, Knauer M, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 2024;30:2613–22.
Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 2000;101:E215–20.
Johnson A, Bulgarelli L, Pollard T, Celi LA, Mark R, Horng S. MIMIC-IV-ED [Internet]. 2023;Available from: http://dx.doi.org/10.13026/5NTK-KM72
Johnson A, Pollard T, Horng S, Celi LA, Mark R. MIMIC-IV-Note: Deidentified free-text clinical notes [Internet]. 2023;Available from: http://dx.doi.org/10.13026/1N74-NE17
Johnson A, Bulgarelli L, Pollard T, Gow B, Moody B, Horng S, et al. MIMIC-IV [Internet]. 2024;Available from: http://dx.doi.org/10.13026/HXP0-HG59
Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023;10:1.
MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis dataset Creation [Internet]. Github; [cited 2024 Oct 21]. Available from: https://github.com/BIMSBbioinfo/medLLMbenchmark
Gaber F, Shaik M, Allega F, Bilecz AJ, Busch F, Goon K, Franke V, Akalin A. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit Med. 2025;8:263. doi:10.1038/s41746-025-01684-1.
Gilboy N, Tanabe T, Travers D, Rosenau M. Emergency Severity Index (ESI): Triage Tool Emergency Department Care, Version 4. 2011.