Database Restricted Access
MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions
Elizabeth Woo , Michael Craig Burkhart , Emily Alsentzer , Brett Beaulieu-Jones
Published: April 22, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Woo, E., Burkhart, M. C., Alsentzer, E., & Beaulieu-Jones, B. (2025). MIMIC-III-Ext-Synthetic-Clinical-Trial-Questions (version 1.0.0). PhysioNet. https://doi.org/10.13026/30k0-av04.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments.
In our recent study [1], we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We manually reviewed 1000 of these examples and release them here. These examples can then be used to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes.
Background
In our article [1], we generated a dataset of questions and answers related to medical records and used it to effectively fine-tune an instance of Llama-3.1-8B [2]. The training set was generated using a larger model, Llama-3.1-70B-Instruct [2], and marked the first step in process called knowledge distillation [3], wherein larger models with higher resource requirements are used to create datasets to fine-tune smaller models that require fewer resources to use.
Our primary motivation for sharing this dataset on Physionet is to provide other credentialled users access to manually validated question-answer pairs for clinical notes. We demonstrate in our manuscript how these Q&A pairs can be used to finetune models to help them better answer clinically-relevant questions. Researchers could also use these question-answer pairs as an additional source of validation for their own models. As opposed to other question-answer pairs for clinical notes, these questions were entirely generated by a Llama model, and so could provide a unique benchmark for future research.
Methods
For a given patient record in MIMIC-III, we used Llama-3.1-70B-Instruct to generate "different, patient note-specific questions similar to clinical trial eligibility criteria" of a given type. We prompted the model to supply its answers in json format, including the question, type, and answer. We also asked the model to supply the section of the note containing the answer, the verbatim source of its answer, a difficulty level for the question on a scale of 1-10, and an explanation of why its answer is the correct one including how the source helped to answer the question. We included the following question types: "yes/no", "numeric", "na-yes/no", and "na-numeric", where the "na" types correspond to questions "that cannot be answered relying on the information in the note" but "seem like they would be applicable to this patient and are similar to clinical trial eligibility criteria." For "na" type questions, we stipulated the section to be "Not Found" and the source to be "Not in Note." Specific details for each of the four question types are provided as follows:
type | occurrence in data | percentage requiring editing | prompt specifics |
---|---|---|---|
boolean | 29.1% | 15.6% | Give a list of 10 total, different, patient note-specific questions similar to clinical trial eligibility criteria. 5 should have 'No' as the correct answer and 5 should have 'Yes' as the answer. |
numeric | 23.6% | 5.8% | Give a list of ten different, realistic, patient note-specific questions similar to clinical trial eligibility criteria with a numeric answer (only generate questions if numeric answers are appropriate, otherwise end the response). All questions should be specific, many numeric values can be listed more than once so make sure to specify first, last, at admission, on discharge, highest, lowest, on a specific date, within ED / ICU etc.. |
na-boolean | 24.1% | 14.8% | Give a list of 5 Yes/No questions that are not answerable using the note. These should be questions which seem like they would be applicable to this patient and are similar to clinical trial eligibility criteria but cannot be answered based on the information in the note. These questions need to be things where the answer cannot be assumed simply because something is not mentioned (e.g., They should not be questions about whether the patient has been diagnosed with serious or chronic diseases because if they were it would be mentioned in the note, since it is not mentioned we can assume the answer is no rather than NA). Do not generate questions where Yes or No is known or can be inferred. |
na-numeric | 23.2% | 13.4% | Give a list of 5 questions asking for numeric answers but where the note does not contain the answer. These should be questions which seem like they would be applicable to this patient and are similar to clinical trial eligibility criteria but cannot be answered based on the information in the note. |
The column "occurence in data" describes the prevalence of the question type in the 1000 examples, and the "percentage requiring editing" provides the rate at which the questions were edited during the annotation process.
The dataset we provide consists of 1000 questions and answers generated using the above process that were then manually reviewed using an open-source annotation tool [4]. The Q&A pairs were selected at random for review from a total set of 42,498 instances. Records for 954 unique subjects in MIMIC-III were included.
Data Description
The csv file 'annotated_synthetic_questions.csv' contains 1000 records from synthetic data generation with schema as follows:
name | description |
---|---|
subject_id | as in MIMIC-III |
hadm_id | as in MIMIC-III |
question | generated string |
answer_available | boolean integer indicating if the question has an answer |
answer | generated string or numerical value, empty if answer_available is 0 |
difficulty | assigned difficulty value on a scale of 1-10 |
text | text of the clinical record |
type | question type, one of {'yes', 'na-bool', 'numeric', 'na-numeric'} |
same_question | boolean integer indicating if the question was manually edited |
same_answer | boolean integer indicating if the answer was manually edited |
changed | boolean integer indicating if the question, answer, or explanation were manually edited |
We've also included a 'data_dictionary.xlsx' file that provides this schema with a column to allow for type-validation.
Usage Notes
Developing resource-efficient LLMs to extract relevant information from clinical notes is a rapidly advancing discipline with many open questions. For example, there may be better ways to make the distillation process more data-efficient. In this work, we showed how fine-tuning on only a fraction of the synthetic dataset (e.g. 8B- H-25k) still appreciably enhances the base 8B-Instruct model. Different criteria for selecting a subset of the fine-tuning data may better maintain performance while decreasing data requirements (please see our manuscript for further details).
Code and tools
Code to reproduce the results shared in our manuscript can be found on Github [5]. This framework can readily be adapted to finetuning models using the question and answer pairs included in this dataset. For the manual review process, a separate open-source software tool was developed, also available on Github [4]. This tool facilitates record review from within a web browser. Users with minimal technical experience can check patient records against the generated question-answer pairs and refine answers if needed.
Potential Limitations
Manual review of this data generated by Llama-3.1-70B-Instruct helped to elucidate some failure modes for these models. Errors encountered during the review process were manually corrected; however, some overarching patterns remain in the dataset that may be of interest for potential users. The model sometimes lacked creativity when generating questions with unspecified answers. To generate questions that could not be answered using the contents of a note, the model seemed to commonly inquire about BMI (height and weight measurements are recorded separately from these notes and so are often not contained in the text) and the results from a 6-minute walking test (6MWT). In the full test set containing 42,498 instances, we found 997 questions related to BMI (98.7% of which resolved n/a) and 666 questions related to a 6-minute walk test (all of which resolved n/a). The model would also ask about measurements from a patient prior to them seeking medical attention, which are typically unavailable in these notes. Additionally, the models would sometimes struggle with repetitive generation. In the test set, we found 1,676 (3.94%) questions containing “creatinine”. Admittedly, our prompt for numeric type questions included an example “What was the patient's highest creatinine measurement recorded in the note?” However, a majority (1,151) of these questions were of na-numeric, boolean, or na-boolean type, and none of those prompts mention creatinine. We found that carefully worded prompts could help to avoid some of the issues described in the previous section. By rewording questions, we could deter the model from drawing inferences and obtain less ambiguous question-answer pairs. For questions that asked if a patient had a history of X, where X was not mentioned in the note, the model would sometimes conclude that a patient did not have a history of X because X was not mentioned in the note, and other times conclude that the question could not be definitively answered from the contents of the note. This ambiguity could be resolved by modifying the question to ask if a patient’s history of X could be found in the note. This is especially critical because it allows us to use a combination of clinical expertise and post-processing to knowingly make assumptions where appropriate about whether X would have been in the note if they had it, as opposed to the model making this assumption for us without our knowledge. In a similar vein, questions on the highest recorded value of Y could be reworded to ask about the highest value of Y found in the note. Instead of asking if a patient’s measurement for Z fell within a normal range, we could ask the model to return the patient’s measurement for Z and then evaluate if Z fell within the standard reference range as a separate step. This allowed us to avoid having the model reason about ranges of values, a known area of difficulty.
Release Notes
This version (1.0.0) corresponds to the first release.
Ethics
This dataset is derived from MIMIC and does not link to any external data sources or include any analyses which would enable the re-identification of participants in MIMIC. It therefore falls under the same consent and ethics approvals as the original MIMIC dataset.
All model training and analysis was performed on the Randi high performance computing cluster at the University of Chicago's Center for Research Informatics. Randi is HIPPA-compliant and has been audited and approved for the handling of patient data.
Acknowledgements
This work was funded in part by the National Institutes of Health, specifically the National Institute of Neurological Disorders and Stroke grant number R00NS114850 to BKB. This project would not have been possible without the support of the Center for Research Informatics at the University of Chicago and particularly the High-Performance Computing team. The authors are grateful for the resources and support this team provided throughout the duration of the project. The Center for Research Informatics is funded by the Biological Sciences Division at the University of Chicago with additional funding provided by the Institute for Translational Medicine, CTSA grant number 2U54TR002389-06 from the National Institutes of Health.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Woo EG, Burkhart MC, Alsentzer E, Beaulieu-Jones BK (2024). "Synthetic Data Distillation Enables the Extraction of Clinical Information at Scale". medRxiv 2024.09.27.24314517; doi: https://doi.org/10.1101/2024.09.27.24314517
- Dubey A, et al (2024). "The Llama 3 herd of models". arXiv [cs.AI]. Available from: http://dx.doi.org/10.48550/arXiv.2309.03882
- Xu X, et al (2024). "A survey on Knowledge Distillation of Large Language Models". arXiv [cs.CL]. 2024. Available from: http://arxiv.org/abs/2402.13116
- Beaulieu-Jones BK (2024). "Annotation UI". Available from: https://github.com/bbj-lab/annotation-ui
- Beaulieu-Jones BK. "clinical-synthetic-data-distil." Available from: https://github.com/bbj-lab/clinical-synthetic-data-distil
Parent Projects
Access
Access Policy:
Only registered users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0
Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/30k0-av04
DOI (latest version):
https://doi.org/10.13026/rqxd-sz93
Topics:
large language models
synthetic data distillation
clinical trial eligibility
Corresponding Author
Files
- sign the data use agreement for the project