Database Restricted Access
MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions
Elizabeth Woo , Michael Craig Burkhart , Emily Alsentzer , Brett Beaulieu-Jones
Published: April 30, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Woo, E., Burkhart, M. C., Alsentzer, E., & Beaulieu-Jones, B. (2025). MIMIC-IV-Ext-Apixaban-Trial-Criteria-Questions (version 1.0.0). PhysioNet. https://doi.org/10.13026/4p6q-vb04.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study, we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We then used these questions to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes.
To evaluate the resulting models, we created 23 questions resembling eligibility criteria from the apixaban clinical trial and evaluated them on a random sample of 100 patient notes from MIMIC-IV. Notes from MIMIC-IV were taken from after 2012 to ensure no overlap with any of the notes from MIMIC-III which were used to generate the data used to finetune the models. We release the 2300 total question-answer pairs as a dataset here.
Background
In our recent article [1], we created 23 boolean and numeric questions resembling eligibility criteria from the 2011 ARISTOTLE trial [2] comparing apixaban to warfarin. Using these questions, we manually annotated notes for 100 patients from MIMIC-IV, taken after 2012.
Our primary motivation for sharing this dataset on Physionet is to provide other credentialled users access to manually-created question-answer pairs for clinical notes. We used these question-answer pairs in our manuscript to evaluate the effectiveness of finetuning LLMs for answering questions on clinical notes. In addition to reproducing our results, members of the research community could use these examples as ground truth data for finetuning their own models in the future, or as a benchmark dataset for validating LLM performance in the clinical domain.
Methods
After restricting to notes taken after 2012, the 100 patients were selected randomly from MIMIC-IV. A human reviewer validated each of the 2300 question-answer pairs and corrected them if necessary.
There were 23 questions (15 boolean, 8 numeric) answered for each of the 100 patients, giving 2300 question-answer pairs in total. We created the set of questions to model clinical trial inclusion criteria, and asked the same questions for each patient.
The 15 boolean questions were as follows:
- Does the note describe the patient as having atrial fibrillation (afib)? Answer "No" if the note describes the patient as having afib secondary to another reversible cause.
- Does the note describe the patient as ever being diagnosed with depression or major depressive disorder (MDD)? Answer "No" unless the note describes a diagnosis or history of depression.
- Does the note describe the patient as ever being diagnosed with schizophrenia or any schizoaffective disorders? Answer "No" unless the note describes a diagnosis or history of a schizoaffective disorder.
- Does the note describe the patient as ever being diagnosed with bipolar disorder? Answer "No" unless the note describes a diagnosis or history of bipolar disorder.
- Does the note describe the patient as ever having any hemorrhagic tendencies or blood dyscrasias? Answer "No" unless the note describes a diagnosis or history of hemorrhagic tendencies or blood dyscrasias.
- Does the note describe the patient as having a stroke during this admission or within the last month? (Answer "Yes" for any recent stroke if the date is unclear, answer "No" if no stroke is mentioned or a prior stroke occurred but it was not recent.)
- Does the note describe the patient as ever having peptic ulcer disease?
- Does the note describe the patient as having serious bleeding in the past 6 months? Answer "No" unless the note describes a serious recent bleeding issue.
- Does the note describe the patient as having a planned or past ablation procedure for afib? Answer "No" unless the note includes information about a past or planned ablation for afib.
- Does the note describe the patient as ever having valvular disease (stenosis) requiring surgery? Answer "No" if there is mention of stenosis without surgery.
- Does the note describe the patient as having heart failure?
- Does the note describe the patient as having diabetes mellitus (DM1, DM2, T2D, T1DM, T2DM)?
- Does the note describe the patient as having arterial hypertension (high bp e.g. >140, or HTN)? This includes pre-existing hypertension and treated hypertension.
- Does the note describe the patient as ever having a stroke or transient ischemic attack (TIA)? Answer "No" unless the note includes information about the patient having a prior stroke or TIA.
- Does the note describe the patient as being unable to make medical decisions upon discharge? Answer "No" unless there is evidence the patient cannot make their own medical decisions. Answer "Yes" if there is clear mention of dementia or the patient is deceased.
The 8 numeric questions were as follows:
- What is the lowest platelet count (PLT) mentioned in the note? Answer "NA" if no platelet count (PLT) is available in the note.
- What is the highest total bilirubin (TotBili, Bili) mentioned in the note? Answer "NA" if no bilirubin value is available in the note.
- What is the highest aspartate aminotransferase level (AST) mentioned in the note? Answer "NA" if no AST value is available in the note.
- What is the highest serum creatinine (Creat) mentioned in the note? Answer "NA" if no creatinine value is available in the note.
- What is the lowest hemoglobin (HGB) mentioned in the note? Answer "NA" if no HGB value is available in the note.
- What is the highest CHADS2 score mentioned? Answer "NA" if no CHADS2 score is in the note.
- What is the lowest left ventricular ejection (LVEF, ef, ejection fraction) fraction mentioned in the note? Answer "NA" if no LVEF is in the note, Answer 55 if the lowest value is 55%% or greater.
- What is the highest blood glucose lab mentioned? Answer "NA" if no blood glucose score is in the note.
Data Description
The csv file 'annotated_apixaban_combined.csv' contains a header and 2300 rows with the following columns:
column name | description |
---|---|
text | text of the MIMIC note |
note_id | from MIMIC |
hadm_id | from MIMIC |
criterion | question label |
question_type | numeric or boolean |
question | one of the 23 questions listed above |
answer | as determine from manual review |
not_specified | boolean indicating if the question cannot be answered from the contents of the note |
Summary statistics for the 15 boolean questions were as follows:
Question | Answer | Count (%) | |
---|---|---|---|
1 | Does the note describe the patient as having atrial fibrillation (afib)? Answer "No" if the note describes the patient as having afib secondary to another reversible cause. | Yes | 71 (71%) |
No | 29 (29%) | ||
2 | Does the note describe the patient as ever being diagnosed with depression or major depressive disorder (MDD)? Answer "No" unless the note describes a diagnosis or history of depression. | Yes | 23 (23%) |
No | 77 (77%) | ||
3 | Does the note describe the patient as ever being diagnosed with schizophrenia or any schizoaffective disorders? Answer "No" unless the note describes a diagnosis or history of a schizoaffective disorder. | Yes | 2 (2%) |
No | 98 (98%) | ||
4 | Does the note describe the patient as ever being diagnosed with bipolar disorder? Answer "No" unless the note describes a diagnosis or history of bipolar disorder. | Yes | 5 (5%) |
No | 95 (95%) | ||
5 | Does the note describe the patient as ever having any hemorrhagic tendencies or blood dyscrasias? Answer "No" unless the note describes a diagnosis or history of hemorrhagic tendencies or blood dyscrasias. | Yes | 18 (18%) |
No | 82 (82%) | ||
6 | Does the note describe the patient as having a stroke during this admission or within the last month? (Answer "Yes" for any recent stroke if the date is unclear, answer "No" if no stroke is mentioned or a prior stroke occurred but it was not recent) | Yes | 16 (16%) |
No | 84 (84%) | ||
7 | Does the note describe the patient as ever having peptic ulcer disease? | Yes | 6 (6%) |
No | 94 (94%) | ||
8 | Does the note describe the patient as having serious bleeding in the past 6 months? Answer "No" unless the note describes a serious recent bleeding issue. | Yes | 20 (20%) |
No | 80 (80%) | ||
9 | Does the note describe the patient as having a planned or past ablation procedure for afib? Answer "No" unless the note includes information about a past or planned ablation for afib. | Yes | 5 (5%) |
No | 95 (95%) | ||
10 | Does the note describe the patient as ever having valvular disease (stenosis) requiring surgery? Answer "No" if there is mention of stenosis without surgery. | Yes | 10 (10%) |
No | 90 (90%) | ||
11 | Does the note describe the patient as having heart failure? | Yes | 53 (53%) |
No | 47 (47%) | ||
12 | Does the note describe the patient as having diabetes mellitus (DM1, DM2, T2D, T1DM, T2DM)? | Yes | 44 (44%) |
No | 56 (56%) | ||
13 | Does the note describe the patient as having arterial hypertension (high bp e.g. >140, or HTN)? This includes pre-existing hypertension and treated hypertension. | Yes | 82 (82%) |
No | 47 (47%) | ||
14 | Does the note describe the patient as ever having a stroke or transient ischemic attack (TIA)? Answer "No" unless the note includes information about the patient having a prior stroke or TIA | Yes | 19 (19%) |
No | 81 (81%) | ||
15 | Does the note describe the patient as being unable to make medical decisions upon discharge? Answer "No" unless there is evidence the patient cannot make their own medical decisions. Answer "Yes" if there is clear mention of dementia or the patient is deceased. | Yes | 13 (13%) |
No | 87 (87%) |
Summary statistics for the 8 numeric questions were as follows:
Question | Mean value | Median value | Standard deviation | Range | NAs | |
---|---|---|---|---|---|---|
1 | What is the lowest platelet count (PLT) mentioned in the note? Answer "NA" if no platelet count (PLT) is available in the note. | 148.53 | 147.50 | 90.8 | 15-364 | 60 (60%) |
2 | What is the highest total bilirubin (TotBili, Bili) mentioned in the note? Answer "NA" if no bilirubin value is available in the note. | 0.903 | 0.600 | 1.11 | 0.2-6.8 | 33 (33%) |
3 | What is the highest aspartate aminotransferase level (AST) mentioned in the note? Answer "NA" if no AST value is available in the note. | 194.4 | 36.0 | 1049.597 | 8-8627 | 33 (33%) |
4 | What is the highest serum creatinine (Creat) mentioned in the note? Answer "NA" if no creatinine value is available in the note. | 1.586 | 1.200 | 1.199 | 0.5-7.8 | 3 (3%) |
5 | What is the lowest hemoglobin (HGB) mentioned in the note? Answer "NA" if no HGB value is available in the note. | 10.21 | 10.15 | 2.054 | 6.0-15.9 | 2 (2%) |
6 | What is the highest CHADS2 score mentioned? Answer "NA" if no CHADS2 score is in the note. | 3.95 | 3.50 | 1.39 | 1-6 | 80 (80%) |
7 | What is the lowest left ventricular ejection (LVEF, ef, ejection fraction) fraction mentioned in the note? Answer "NA" if no LVEF is in the note, Answer 55 if the lowest value is 55%% or greater. | 47.89 | 50.00 | 14.4 | 20-75 | 53 (53%) |
8 | What is the highest blood glucose lab mentioned? Answer "NA" if no blood glucose score is in the note. | 142.1 | 126.0 | 52.2 | 78-412 | 3 (3%) |
Usage Notes
This dataset accompanies our manuscript [1] that includes more detailed methods and associated results. In brief, this dataset could be used to help evaluate an LLM's ability to answer clinical questions from notes. This is a pressing problem that has received active interest from the research community recently [4].
Code to evaluate LLMs using this data can be found on Github [3].
Known limitations
We used this dataset to evaluate the performance of LLM's for answering questions about clinical notes. During evaluation, we would supply the note and the question to the LLM, and compare the LLM's response with the correct answer. For this purpose, these question-answer pairs seemed quite adequate. As each Q&A pair required manual review, we only included data for 100 persons in this set. This is admittedly quite small in proportion to the full MIMIC-IV dataset, and may not be large enough to properly represent the MIMIC-IV cohort for some applications.
Release Notes
This version (1.0.0) corresponds to the first release.
Ethics
This dataset is derived from MIMIC and does not link to any external data sources or include any analyses which would enable the re-identification of participants in MIMIC. It therefore falls under the same consent and ethics approvals as the original MIMIC dataset.
All model training and analysis was performed on the Randi high performance computing cluster at the University of Chicago's Center for Research Informatics. Randi is HIPPA-compliant and has been audited and approved for the handling of patient data.
Acknowledgements
This work was funded in part by the National Institutes of Health, specifically the National Institute of Neurological Disorders and Stroke grant number R00NS114850 to BKB. This project would not have been possible without the support of the Center for Research Informatics at the University of Chicago and particularly the High-Performance Computing team. The authors are grateful for the resources and support this team provided throughout the duration of the project. The Center for Research Informatics is funded by the Biological Sciences Division at the University of Chicago with additional funding provided by the Institute for Translational Medicine, CTSA grant number 2U54TR002389-06 from the National Institutes of Health.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Woo EG, Burkhart MC, Alsentzer E, Beaulieu-Jones BK (2024). "Synthetic Data Distillation Enables the Extraction of Clinical Information at Scale". medRxiv 2024.09.27.24314517; doi: https://doi.org/10.1101/2024.09.27.24314517
- Granger CB, et al. (2011). "Apixaban versus warfarin in patients with atrial fibrillation". N. Engl. J. Med. 365: 981–992.
- Beaulieu-Jones BK. "clinical-synthetic-data-distil." Available from: https://github.com/bbj-lab/clinical-synthetic-data-distil
- Hager P, Jungmann F, Holland R, et al (2024). "Evaluation and mitigation of the limitations of large language models in clinical decision-making." Nat. Med. 30: 2613–2622.
Parent Projects
Access
Access Policy:
Only registered users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0
Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/4p6q-vb04
DOI (latest version):
https://doi.org/10.13026/k40z-bm16
Topics:
clinical q and a evaluation set
clinical trial eligibility
Corresponding Author
Files
- sign the data use agreement for the project