Database Credentialed Access

MIMIC-IV-Ext Clinical Decision Making: A MIMIC-IV Derived Dataset for Evaluation of Large Language Models on the Task of Clinical Decision Making for Abdominal Pathologies

Paul Hager Friederike Jungmann Daniel Rueckert

Published: May 17, 2024. Version: 1.0

When using this resource, please cite: (show more options)
Hager, P., Jungmann, F., & Rueckert, D. (2024). MIMIC-IV-Ext Clinical Decision Making: A MIMIC-IV Derived Dataset for Evaluation of Large Language Models on the Task of Clinical Decision Making for Abdominal Pathologies (version 1.0). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Clinical decision making is one of the most impactful parts of a physician's responsibilities and stands to benefit greatly from AI solutions such as large language models (LLMs). However, while many datasets exist to test the performance of AI models on constructed case vignettes, such as medical licensing exams, these tests fail to assess many skills that are necessary for deployment in a realistic clinical decision making environment. To understand how useful LLMs are in real-world settings, we must evaluate them in the wild, i.e. on real-world data under realistic conditions. To address this need, we have created a curated dataset based on the MIMIC-IV database, spanning 2400 real patient cases and four common abdominal pathologies: appendicitis, cholecystitis, diverticulitis, and pancreatitis. Each patient case contains the filtered and curated information necessary to arrive at the delivered diagnosis of the physician and can be used in an interactive manner to test the information gathering, synthesizing, and diagnostic capabilities of AI models.


LLMs have generated much excitement in the medical community due to their excellent performance on medical licensing exams [1-5]. However, while these tests are well suited to evaluate the general medical knowledge encoded within a model, they fail to capture the complexities and uncertainties of daily clinical decision making. They typically provide all necessary information to complete the task up-front and often use artificial vignettes to test specific knowledge. Clinical decision making, on the other hand, is defined as "a contextual, continuous, and evolving process, where data are gathered, interpreted, and evaluated in order to select an evidence-based choice of action" [6].

To address the weaknesses of current benchmarks, we have constructed a dataset using the MIMIC-IV (v. 2.2) [7-9] database, specifically the hospital module, that can be used to explicitly test autonomous clinical decision making capabilities of LLMs using real-world data. Using the dataset, we can simulate a realistic clinical decision making scenario, where LLMs request and synthesize information from available diagnostic exams to come to a diagnosis. The model is presented with the history of present illness of the patient and then iteratively requests additional diagnostic information until it is confident enough to make a diagnosis. The available diagnostic exams encompass the essential clinical modalities including physical examinations, laboratory events, and imaging. The requests for diagnostic exams are parsed and interpreted to serve the appropriate data from the patient's electronic health record, which have been extracted, compartmentalized, and pre-processed in this dataset. The final diagnosis and treatment can then be compared to the actual diagnosis and procedures performed on the patient and the diagnostic pathway taken by the LLM can be compared with those from established guidelines. On the whole, this closely simulates a clinicians responsibilities during clinical decision making, including information gathering, continuous evaluation of the evidence, differential diagnosis, assessment of severity, and treatment planning.


The data was extracted from the MIMIC-IV database hospital module and processed using the code from [10]. This dataset contains four abdominal pathologies: appendicitis, cholecystitis, diverticulitis and pancreatitis. To filter for these diagnoses we used the ICD code and discharge diagnosis in the discharge summary. We only include patient cases where the primary diagnosis is one of the four abdominal pathologies. Thus, for the patient case to be included, the diagnosis must be the first diagnosis mentioned. If a patient was diagnosed with more than one of our selected four pathologies they were removed.

Each patient then had the discharge summary split into individual sections to extract the history of present illness (HPI) and physical examination. Samples were removed where the pathology is explicitly mentioned in the HPI as these are usually transfers where the diagnosis was already made before admission and so the data is not suitable for the clinical decision making task.

The first instance of a laboratory and microbiology test was included. Tests up to one day before admission were included, if the test was not associated with any other hospital admission. If multiple instances of a test were present in MIMIC we include only the first recorded instance to simulate a therapy-naive situation. All tests were also mapped to common synonyms and alternative names to allow for better matching during the clinical decision making task.

All radiology reports up to 24 hours before admission were included, analogous to the laboratory tests. Using the provided exam name we assigned each test to an anatomical region and an imaging modality. Mappings are made for special instances for easier matching, i.e. CTU to CT and MRCP to MRI. Radiology reports were split into their individual sections and only the findings section was used in the dataset. This was done to not include the diagnosis suggested by the radiologist in the conclusions or impressions sections which would trivialize the task.

All procedures and operations were extracted using ICD codes form the operations table and parsed from the discharge summaries. These are used to determine if the treatment recommended by the model is appropriate.

All patients missing any of our primary modalities (HPI, physical examination, laboratory results, imaging) were removed. Further cleaning removed any remaining references to the pathology by replacing them with the MIMIC standard for anonymized information "___". The final dataset includes 2400 unique patients with 957 appendicitis, 648 cholecystitis, 257 diverticulitis, and 538 pancreatitis patients. The dataset contains physical examinations for all patients (2400), 138,788 laboratory results from 480 unique laboratory tests and 4403 microbiology results from 74 unique tests. Furthermore, the dataset contains 5959 radiology reports, including 1836 abdominal CTs, 1728 chest x-rays, 1325 abdominal ultrasounds, 342 abdominal x-rays and 227 MRCPs. Finally, there were 395 unique procedures recorded over all patients, with a total of 2917 ICD procedures plus the 2400 free text procedures specified in the discharge summaries.

Data Description

The data is stored in csv files, primarily centered around the hadm_ids of the patients in the dataset. The fields are named identically to those in MIMIC when possible / applicable.

New files
history_of_present_illness.csv with the "hpi" field which was parsed from the discharge summary.

physical_examination.csv with the "pe" field which was parsed from the discharge summary.

discharge_diagnosis.csv with the "discharge_diagnosis" field which was parsed from the discharge summary.

discharge_procedures.csv with the "discharge_procedure" field which was parsed from the discharge summary.

lab_test_mapping.csv is comprised of a mapping of the unique lab test labels to their itemids, fluids and category. Counts are provided for optional filtering. The corresponding_ids field maps itemids to medically identical fields so that requests for laboratory results can be matched and provided based on semantic meaning.

Files primarily constructed using existing tables
laboratory_tests.csv contains fields taken from the labitems.csv and d_labitems.csv files. valuestr is a text representation of the combination of the value of the lab result and its units (parsed from various fields). To isolate only the value, the string should be split on the space character and the first element is the value. For now, only the first (ordered by charttime) recorded instance of each test per hadm_id is stored.

microbiology.csv contains fields from microbiologyevents.csv. Combines multiple bacteria for an itemid. Returns first instance as with laboratory_tests.

radiology_reports.csv contains fields from radiology.csv and radiology_detail.csv. The "modality" and "region" fields were extracted from the exam name.

icd_diagnosis.csv and icd_procedures.csv contains fields from diagnosis_icd.csv, d_icd_diagnoses.csv, procedures_icd.csv, d_icd_procedures.csv.

Usage Notes

This dataset is designed to be used in conjunction with the evaluation framework that has been developed to test Large Language Models (LLMs). The framework can be found at [11].

The framework takes the data from the dataset and tests the performance of LLMs on the autonomous clinical decision making task. It serves each individual piece of information, but only when requested and awaits a final diagnosis and treatment plan of the model. The dataset can then be used to check the diagnostic pathway and treatment undertaken by the attending physicians to be compared with those suggested by the models.

The dataset can also be used to test the second reader capabilities of LLMs. Code for this is also in [11]. By serving the entire patient case at once, we simulate an environment where the attending physician has already come to a diagnosis and the model is used to provide an unbiased second opinion.

The dataset can be used for other purposes such as testing AI assisted diagnostic workflows. For such cases the code must be adjusted to wait for human input and to possibly provide a continuous list of probable diagnoses to see if this helps physicians arrive at the correct diagnosis.

Limitations to keep in mind are that only the first laboratory event of each lab test is currently included, so potential development of disease and response to treatment is not properly captured. Furthermore, only those patients that received some form of abdominal imaging, laboratory tests, and physical examination have been included and those that were transferred with a known diagnosis already have been excluded.


The dataset is a derivative dataset of MIMIC-IV and thus no new patient data was collected. The ethics approval of the dataset follows from that of the parent MIMIC dataset.

Conflicts of Interest

No conflicts of interest.


  1. Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., ... & Tseng, V. (2023). Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS digital health, 2(2), e0000198.
  2. Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., & Chartash, D. (2023). How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical Education, 9(1), e45312.
  3. Shortliffe, E. H., & Sepúlveda, M. J. (2018). Clinical decision support in the era of artificial intelligence. Jama, 320(21), 2199-2200.
  4. Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., ... & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172-180.
  5. Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., ... & Horvitz, E. (2023). Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452.
  6. Tiffen, J., Corbridge, S. J., & Slimmer, L. (2014). Enhancing clinical decision making: development of a contiguous definition and conceptual framework. Journal of professional nursing : official journal of the American Association of Colleges of Nursing, 30(5), 399–405.
  7. Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV (version 2.2). PhysioNet.
  8. Johnson, A.E.W., Bulgarelli, L., Shen, L. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10, 1 (2023).
  9. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
  10. MIMIC Clinical Decision Making Dataset Creation Code. Available from: [08.05.2024]
  11. MIMIC Clinical Decision Making Evaluation Framework Code. Available from: [08.05.2024]

Parent Projects
MIMIC-IV-Ext Clinical Decision Making: A MIMIC-IV Derived Dataset for Evaluation of Large Language Models on the Task of Clinical Decision Making for Abdominal Pathologies was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.