Name: EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries
Published: June 26, 2024
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Sunjun Kweon , Jiyoun Kim , Heeyoung Kwak , Dongchul Cha , Hangyul Yoon , Kwang Hyun Kim , Jeewon Yang , Seunghyun Won , Edward Choi

Published: June 26, 2024. Version: 1.0.1

When using this resource, please cite: (show more options)
Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K. H., Yang, J., Won, S., & Choi, E. (2024). EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries (version 1.0.1). PhysioNet. https://doi.org/10.13026/acga-ht95.

MLA	Kweon, Sunjun, et al. "EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries" (version 1.0.1). PhysioNet (2024), https://doi.org/10.13026/acga-ht95.
APA	Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K. H., Yang, J., Won, S., & Choi, E. (2024). EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries (version 1.0.1). PhysioNet. https://doi.org/10.13026/acga-ht95.
Chicago	Kweon, Sunjun, Kim, Jiyoun, Kwak, Heeyoung, Cha, Dongchul, Yoon, Hangyul, Kim, Kwang Hyun, Yang, Jeewon, Won, Seunghyun, and Edward Choi. "EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries" (version 1.0.1). PhysioNet (2024). https://doi.org/10.13026/acga-ht95.
Harvard	Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K. H., Yang, J., Won, S., and Choi, E. (2024) 'EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries' (version 1.0.1), PhysioNet. Available at: https://doi.org/10.13026/acga-ht95.
Vancouver	Kweon S, Kim J, Kwak H, Cha D, Yoon H, Kim K H, Yang J, Won S, Choi E. EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries (version 1.0.1). PhysioNet. 2024. Available from: https://doi.org/10.13026/acga-ht95.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.

Background

Previous works have explored patient-specific QA using EHR discharge summaries. [1] introduced emrQA, the first public clinical note QA dataset generated from expert-annotated question templates and i2b2 annotations [2,3,4,5]. [6] proposed a dataset built on MIMIC-III [7] discharge summaries by extracting answer evidence, followed by generating corresponding questions using a neural question generator model. Additionally, [8] and [9] proposed discharge summary-based QA datasets on why-questions and drug-reason relations, respectively. However, these existing datasets fall short in reflecting the complexity and diversity of real-world questions posed by physicians. Firstly, they only address questions based on a single note, neglecting the frequent clinical need to reference multiple discharge summaries for patients with multiple admissions. Secondly, the question topics are constrained. For example, emrQA is confined to topics within i2b2 annotations, such as smoking, medication, obesity, and heart disease. Although [6] aimed to increase diversity, [10] noted that 96% of the questions still follow emrQA's templates, indicating continued limitation. [8] focused solely on why-questions, limiting the scope of topics, while [9] centered on drug-reasoning based on n2c2 annotations [11].

Methods

We construct EHRNoteQA using discharge summaries from the MIMIC-IV EHR database [12], containing anonymized patient records from Beth Israel Deaconess Medical Center between 2008 and 2019. The MIMIC-IV database includes 331,794 discharge summaries for 145,915 unique patients, averaging 2.3 notes per patient. Given the average length of a single patient's accumulated discharge summaries is around 8,000 tokens, current LLMs face challenges processing these lengthy texts. To address this, we preprocess the notes to reduce their length by 10% and categorize patients by the length of their summaries. We divide them into Level 1 (up to 3,000 tokens) and Level 2 (3,000 to 7,000 tokens) groups. Then, we randomly sample 1,000 patients—550 from Level 1 and 450 from Level 2—for the next step. Utilizing GPT-4 [13] (under Azure HIPAA-Compliant Platform [14]), we generate the initial draft of EHRNoteQA by creating clinically meaningful questions, answers, and incorrect answer options based on each patient's discharge summaries. This data construction allows evaluation of LLMs through open-ended and multi-choice methods. The generated questions and answers are reviewed and modified by three clinicians for clinical relevance, ensuring questions are appropriate, answers accurate, and incorrect options plausible. Out of the initial 1,000 questions, 38 were removed, 206 were revised for clarity, 338 answers were modified, and 966 incorrect answers were adjusted.

Data Description

EHRNoteQA dataset comprises 962 questions paired with the discharge summaries of 962 distinct patients. As shown in table below, Level 1 data includes a total of 529 patients: 264 patients admitted once (one discharge summary) and 265 patients admitted twice (two discharge summaries). Level 2 data includes a total of 433 instances: 145 patients admitted once, 144 patients admitted twice, and 144 patients admitted three times. For detailed information on the dataset distribution across levels, refer to the table below.

Category	# of Discharge Summaries per patient	# of Questions	Total # of Discharge Summaries
Level1	1	264	264
	2	265	530
Level2	1	145	145
	2	144	288
	3	144	432
Total		962	1,659

We analyze the types of information (topics) addressed by the questions in EHRNoteQA. We establish 10 categories (i.e., treatment, assessment, problem, etiology, sign/symptom, vitals, test results, history, instruction, plan). Each of the 962 questions were categorized manually by the authors, with examples and proportions presented in below.

Question Category	Example	Proportion
Treatment	What was the treatment provided for the patient’s left breast cellulitis?	64%
Assessment	Was the Mitral valve repair carried out successfully?	19%
Problem	What was the main problem of the patient?	19%
Etiology	Why did the patient’s creatinine level rise significantly upon admission?	20%
Sign/Symptom	What was the presenting symptom of the patient’s myocardial infarction?	12%
Vitals	What was the range of the patient’s blood pressure during second stay?	3%
Test Results	What were the abnormalities observed in the patient’s CT scans?	14%
History	Has the patient experienced any surgical interventions prior to the acute appendicitis?	12%
Instruction	How was the patient instructed on weight-bearing after his knee replacement?	3%
Plan	What is the future course of action planned for patient’s left subclavian stenosis?	5%

The EHRNoteQA dataset file, EHRNoteQA.jsonl, contains 962 records, each representing a unique patient. Each record of the dataset is a json line consisting of the following structure:

category : Indicates whether the data is Level 1 or Level 2 (Either Level1, Level2)
num_notes : The number of discharge summaries for the patient in MIMIC-IV (Either 1, 2, 3)
patient_id : A key that directly links to the discharge summary in MIMIC-IV (the subject_id in MIMIC-IV discharge summary)
clinician : The anonymized identifier of the clinician who reviewed and modified the data (Either a, b, c)
question : The EHRNoteQA question
choice_A, choice_B, choice_C, choice_D, choice_E : The five answer choices
answer : The correct answer choice (Either A, B, C, D, E)

Usage Notes

Future Direction

The current version of EHRNoteQA is tailored to align with the context length of currently available LLMs, categorizing the data into Level 1 (4k tokens) and Level 2 (8k tokens). As models capable of handling longer context lengths are developed and released, we plan to extend EHRNoteQA to include datasets with more admissions and longer discharge summaries. Additionally, while EHRNoteQA focuses on discharge summaries due to their frequent use in clinical practice, we recognize the importance of expanding our dataset to include other types of notes essential in healthcare settings, such as radiology notes and physician notes.

GitHub Repository for this Project

In our github repository [15], we provide code to preprocess MIMIC-IV discharge summaries (removing exceesive white spaces) and merge them with EHRNoteQA into a single data file.

Paper

For more detailed information about EHRNoteQA, please refer to our paper [16].

Release Notes

1.0.0 - Initial Release

1.0.1 - Minor Updates and Corrections

- Added missing question mark
- Corrected verb singular/plural forms
- Fixed typo (changed "pateint" to "patient")

Ethics

The authors have no ethics statement to declare.

Acknowledgements

This work was supported by the KAIST-NAVER Hyper-Creative AI Center, and the National Research Foundation of Korea (NRF) grant (NRF-2020H1D3A2A03100945) funded by the Korea government (MSIT)

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Pampari, A., Raghavan, P., Liang, J., & Peng, J. (2018). emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732.
Uzuner, Ö. (2009). Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 16(4), 561-570.
Uzuner, Ö., Goldstein, I., Luo, Y., & Kohane, I. (2008). Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association, 15(1), 14-24.
Uzuner, Ö., Solti, I., & Cadag, E. (2010). Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 17(5), 514-518.
Uzuner, Ö., South, B. R., Shen, S., & DuVall, S. L. (2011). 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5), 552-556.
Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2021, December). Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 580-587). IEEE.
Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W. H., Feng, M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.
Fan, J. (2019, June). Annotating and characterizing clinical sentences with explicit why-QA cues. In Proceedings of the 2nd Clinical Natural Language Processing Workshop (pp. 101-106).
Moon, S., He, H., Jia, H., Liu, H., & Fan, J. W. (2023). Extractive clinical question-answering with multianswer and multifocus questions: Data set development and evaluation study. JMIR AI, 2(1), e41818.
Lehman, E., Lialin, V., Legaspi, K. Y., Sy, A. J. R., Pile, P. T. S., Alberto, N. R. I., ... & Szolovits, P. (2022). Learning to ask like a physician. arXiv preprint arXiv:2206.02696.
Henry, S., Buchan, K., Filannino, M., Stubbs, A., & Uzuner, O. (2020). 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association, 27(1), 3-12.
Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. https://doi.org/10.13026/1n74-ne17.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Microsoft Azure: https://learn.microsoft.com/en-us/azure/compliance/offerings/offering-hipaa-us [Accessed 3/25/2024]
EHRNoteQA Github. Available from : https://github.com/ji-youn-kim/EHRNoteQA
Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K., ... & Choi, E. (2024). EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings. arXiv preprint arXiv:2402.16040.