Database Credentialed Access
MIMIC-IV-Ext Cardiac Disease
Published: May 6, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Cao, J., & Zhao, S. (2025). MIMIC-IV-Ext Cardiac Disease (version 1.0.0). PhysioNet. https://doi.org/10.13026/khgm-hc33.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
With the rapid development of generative LLMs (large language models) in the field of natural language processing, their potential in medical applications has become increasingly evident. However, most existing studies rely on exam-style questions or artificially designed cases, lacking validation using real patient data. To address this gap, this study leverages the MIMIC-IV database to construct a subset, MIMIC-IV-Ext Cardiac Disease, which includes 4,761 patients diagnosed with cardiac diseases. The dataset covers all relevant clinical examinations from admission to discharge, as well as the final diagnoses.Combining these data with the multi-turn interaction framework we built can be used to test whether large models can guide patients through in-hospital examinations. Moreover, after modifying the MIMIC-IV dataset, our sub-dataset can greatly facilitate researchers in conducting other studies.
Background
A recent survey of 519 articles on the use of LLMs in healthcare revealed that only 5% employed real patient data, while the majority relied on exam questions or expert-designed[1]. This finding not only indicates a relative lack of investigation into genuine clinical applications but also underscores the challenges involved in acquiring and processing real patient data—such as limited data availability and elevated complexity.
In response to these issues, this study constructs a subset of the MIMIC-IV dataset[2], referred to as MIMIC-IV-Ext Cardiac Disease. There are many types of cardiac diseases, each requiring specific examination methods. At the same time, cardiac diseases are often caused by long-term unhealthy lifestyle habits and are frequently accompanied by multiple complications. This complicates patient information, providing a better way to test whether LLMs can develop a comprehensive understanding of a patient's condition. This dataset includes patient records whose primary diagnosis is one of 20 cardiac diseases. Each record covers all the necessary examination information throughout the diagnostic process, encompassing chief complaints, HPI (History of Present Illness), physical examination findings, laboratory test results, imaging results, and final diagnoses.
Methods
In this study, we developed a subset of the MIMIC-IV dataset, called MIMIC-IV-Ext Cardiac Disease, containing 4,761 patient records with a primary diagnosis of one of 20 cardiac diseases. Each record includes data from admission to discharge, including medical exams, imaging tests, laboratory chemical tests, and microbiological cultures. We selected heart disease ICD codes (I20-I25, I30-I50, 410-414, and 420-428) to filter the dataset and extract corresponding diagnostic results.
Data for the HPI, physical exams, and imaging studies were extracted from discharge reports and a manually created synonym dictionary for imaging tests. The extraction of imaging results was complex due to many synonyms for imaging test names and varied report formats. To address this, we randomly selected 200 discharge reports and manually labeled all imaging test names to create a synonym dictionary. To ensure accuracy, we first analyzed how lab and microbiological results were expressed in the reports. These sections were removed before extracting imaging data. Finally, we used the synonym dictionary to extract each patient's imaging test results.
ECG reports were included from the MIMIC-IV-ECG dataset[3]. In addition, imaging results may contain fields such as PRE-CPB, POST-CPB, PRE BYPASS, and POST BYPASS, which indicate that the patient underwent extracorporeal circulation and may have had surgery. To avoid potential data leakage, we excluded records containing these fields during the processing of discharge reports.
Laboratory results focused on in-hospital tests, which were categorized by bodily fluid and test type. We organized the substance names found in various body fluids and grouped substances in blood into basic test panels, such as complete blood count (CBC), liver function tests, kidney function tests, and cardiac biomarkers. For test results from other fluids like urine, ascitic fluid, and pleural effusion, we categorized them as urinalysis, abdominal paracentesis, and thoracentesis, respectively.
The dataset was further cleaned by removing imaging tests with fewer than three cases. The ICD-9 codes were converted to ICD-10 codes using a mapping table[4]. As a result, the dataset contains 4,761 records covering 20 types of heart diseases, as listed in table below. Each record includes the patient’s chief complaint, HPI, physical examination results at admission, at least two imaging examination results, as well as multiple laboratory test and microbiological culture results.
Disease | ICD-10 Code | Number of cases |
Heart failure | I50 | 1420 |
Acute myocardial infarction | I21 | 1398 |
Chronic ischemic heart disease | I25 | 633 |
Atrial fibrillation and flutter | I48 | 434 |
Nonrheumatic aortic valve disorders | I35 | 229 |
Other diseases of pericardium | I31 | 146 |
Paroxysmal tachycardia | I47 | 128 |
Acute pericarditis | I30 | 81 |
Atrioventricular and left bundle-branch block | I44 | 67 |
Nonrheumatic mitral valve disorders | I34 | 49 |
Acute and subacute endocarditis | I33 | 36 |
Angina pectoris | I20 | 35 |
Other cardiac arrhythmias | I49 | 34 |
Cardiomyopathy | I42 | 31 |
Acute myocarditis | I40 | 16 |
Other acute ischemic heart diseases | I24 | 11 |
Subsequent ST elevation (STEMI) and non-ST elevation (NSTEMI) myocardial infarction | I22 | 5 |
Other conduction disorders | I45 | 5 |
Cardiac arrest | I46 | 2 |
Nonrheumatic tricuspid valve disorders | I36 | 1 |
Data Description
Our dataset consists of multiple files, with the core file being heart_diagnoses.csv. It is linked to other files via the hadm_id field. The entire dataset is organized by patient cases as the basic unit. Below are the basic descriptions of each file:
heart_diagnoses.csv
Heart_diagnoses.csv is the core file of the entire dataset. The columns such as note_id and subject_id in this file are derived from the discharge.csv in the MIMIC-IV-Note dataset. Other files in the dataset are linked to this file via the hadm_id field. The other columns in the file represent the patient's history of present illness, chief complaint, and various examination results. The "reports" field contains the machine-generated reports for electrocardiogram (ECG) tests, with different reports separated by a "|".
heart_diagnoses_all.csv and heart_diagnoses_all_true.csv
These two files are both extracted and modified from the diagnoses_icd.csv file in the MIMIC-IV dataset. The difference between them lies in the icd_code and long_title fields. In heart_diagnoses_all.csv, the ICD codes only include the first three characters, and the corresponding disease names represent the broader categories of the diseases. In heart_diagnoses_all_true.csv, the ICD codes are not modified.
heart_labevents_examination_group.csv and heart_microbiologyevents.csv
These two files are both extracted from the labevents.csv file in the MIMIC-IV dataset. The only difference is that the heart_labevents_examination_group.csv file includes an additional "examination_group" field, which represents the basic examination group to which the test belongs.
heart_labevents_first_lab.csv and heart_microbiologyevents_first_micro.csv
These two files are excerpts from the previous two files. Since patients may undergo the same examination multiple times during their hospital stay, these files were created to retain only the results from the first occurrence of each examination for each patient.
heart_procedures.csv
This is an extract from the procedures.csv file in the MIMIC-IV dataset.
HPI.json and RAG_data.json
These two files are used in conjunction with our open-source project, serving as the database for the RAG (Retrieval-Augmented Generation) system in the project.
Descriptions of the fields in each table are provided in the dataset's README.md file.
Usage Notes
In this dataset, we extracted and organized results from different types of examinations specifically for cardiac diseases. It can be used with our multi-turn interaction framework to evaluate LLMs in real clinical settings or independently for tasks like classification or prediction. Each record is identified by hadm_id, making it fully compatible with other MIMIC-IV datasets. While the dataset includes all available test results for each patient from MIMIC-IV, it does not guarantee that all tests performed during hospitalization are present in the original MIMIC-IV dataset.
This dataset can be used directly or in combination with the code we have open-sourced at [5].
Ethics
The dataset is a derivative dataset of MIMIC-IV and thus no new patient data was collected. The ethics approval of the dataset follows from that of the parent MIMIC dataset.
Conflicts of Interest
None to declare.
References
- Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA [Internet]. 2024 Oct 15 [cited 2024 Nov 22]; Available from: https://doi.org/10.1001/jama.2024.21700
- Johnson A, Bulgarelli L, Pollard T, Horng S, Celi LA, Mark R. MIMIC-IV [Internet]. PhysioNet; [cited 2025 Feb 7]. Available from: https://physionet.org/content/mimiciv/1.0/
- Gow B, Pollard T, Nathanson LA, Johnson A, Moody B, Fernandes C, et al. MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset [Internet]. PhysioNet; [cited 2025 Feb 7]. Available from: https://physionet.org/content/mimic-iv-ecg/1.0/
- NBER [Internet]. [cited 2025 Jan 8]. ICD-9-CM to and from ICD-10-CM and ICD-10-PCS Crosswalk or General Equivalence Mappings. Available from: https://www.nber.org/research/data/icd-9-cm-and-icd-10-cm-and-icd-10-pcs-crosswalk-or-general-equivalence-mappings
- AAAjiawei. AAAjiawei/Cardiac-Disease-Diagnosis-Using-Large-Language-Models-in-Real-World-Medical-Scenarios [Internet]. 2025 [cited 2025 Apr 30]. Available from: https://github.com/AAAjiawei/Cardiac-Disease-Diagnosis-Using-Large-Language-Models-in-Real-World-Medical-Scenarios
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/khgm-hc33
DOI (latest version):
https://doi.org/10.13026/6dnk-7r25
Project Website:
http://github.com/AAAjiawei/Cardiac-Disease-Diagnosis-Using-Large-Language-Models-in-Real-World-Medical-Scenarios.
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project