Database Credentialed Access

EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries

Sunjun Kweon Jiyoun Kim Heeyoung Kwak Dongchul Cha Hangyul Yoon Kwang Hyun Kim Jeewon Yang Seunghyun Won Edward Choi

Published: June 26, 2024. Version: 1.0.1

When using this resource, please cite: (show more options)
Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K. H., Yang, J., Won, S., & Choi, E. (2024). EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries (version 1.0.1). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.


Previous works have explored patient-specific QA using EHR discharge summaries. [1] introduced emrQA, the first public clinical note QA dataset generated from expert-annotated question templates and i2b2 annotations [2,3,4,5]. [6] proposed a dataset built on MIMIC-III [7] discharge summaries by extracting answer evidence, followed by generating corresponding questions using a neural question generator model. Additionally, [8] and [9] proposed discharge summary-based QA datasets on why-questions and drug-reason relations, respectively. However, these existing datasets fall short in reflecting the complexity and diversity of real-world questions posed by physicians. Firstly, they only address questions based on a single note, neglecting the frequent clinical need to reference multiple discharge summaries for patients with multiple admissions. Secondly, the question topics are constrained. For example, emrQA is confined to topics within i2b2 annotations, such as smoking, medication, obesity, and heart disease. Although [6] aimed to increase diversity, [10] noted that 96% of the questions still follow emrQA's templates, indicating continued limitation. [8] focused solely on why-questions, limiting the scope of topics, while [9] centered on drug-reasoning based on n2c2 annotations [11].


We construct EHRNoteQA using discharge summaries from the MIMIC-IV EHR database [12], containing anonymized patient records from Beth Israel Deaconess Medical Center between 2008 and 2019. The MIMIC-IV database includes 331,794 discharge summaries for 145,915 unique patients, averaging 2.3 notes per patient. Given the average length of a single patient's accumulated discharge summaries is around 8,000 tokens, current LLMs face challenges processing these lengthy texts. To address this, we preprocess the notes to reduce their length by 10% and categorize patients by the length of their summaries. We divide them into Level 1 (up to 3,000 tokens) and Level 2 (3,000 to 7,000 tokens) groups. Then, we randomly sample 1,000 patients—550 from Level 1 and 450 from Level 2—for the next step. Utilizing GPT-4 [13] (under Azure HIPAA-Compliant Platform [14]), we generate the initial draft of EHRNoteQA by creating clinically meaningful questions, answers, and incorrect answer options based on each patient's discharge summaries. This data construction allows evaluation of LLMs through open-ended and multi-choice methods. The generated questions and answers are reviewed and modified by three clinicians for clinical relevance, ensuring questions are appropriate, answers accurate, and incorrect options plausible. Out of the initial 1,000 questions, 38 were removed, 206 were revised for clarity, 338 answers were modified, and 966 incorrect answers were adjusted.

Data Description

EHRNoteQA dataset comprises 962 questions paired with the discharge summaries of 962 distinct patients. As shown in table below, Level 1 data includes a total of 529 patients: 264 patients admitted once (one discharge summary) and 265 patients admitted twice (two discharge summaries). Level 2 data includes a total of 433 instances: 145 patients admitted once, 144 patients admitted twice, and 144 patients admitted three times. For detailed information on the dataset distribution across levels, refer to the table below.

Category # of Discharge Summaries per patient # of Questions Total # of Discharge Summaries
Level1 1 264 264
  2 265 530
Level2 1 145 145
  2 144 288
  3 144 432
Total   962 1,659

We analyze the types of information (topics) addressed by the questions in EHRNoteQA. We establish 10 categories (i.e., treatment, assessment, problem, etiology, sign/symptom, vitals, test results, history, instruction, plan). Each of the 962 questions were categorized manually by the authors, with examples and proportions presented in below.


Question Category Example Proportion
Treatment What was the treatment provided for the patient’s left breast cellulitis? 64%
Assessment Was the Mitral valve repair carried out successfully? 19%
Problem What was the main problem of the patient? 19%
Etiology Why did the patient’s creatinine level rise significantly upon admission? 20%
Sign/Symptom What was the presenting symptom of the patient’s myocardial infarction? 12%
Vitals What was the range of the patient’s blood pressure during second stay? 3%
Test Results What were the abnormalities observed in the patient’s CT scans? 14%
History Has the patient experienced any surgical interventions prior to the acute appendicitis? 12%
Instruction How was the patient instructed on weight-bearing after his knee replacement? 3%
Plan What is the future course of action planned for patient’s left subclavian stenosis? 5%

The EHRNoteQA dataset file, EHRNoteQA.jsonl, contains 962 records, each representing a unique patient. Each record of the dataset is a json line consisting of the following structure:

  • category : Indicates whether the data is Level 1 or Level 2 (Either Level1, Level2)
  • num_notes : The number of discharge summaries for the patient in MIMIC-IV (Either 1, 2, 3)
  • patient_id : A key that directly links to the discharge summary in MIMIC-IV (the subject_id in MIMIC-IV discharge summary)
  • clinician : The anonymized identifier of the clinician who reviewed and modified the data (Either a, b, c)
  • question : The EHRNoteQA question
  • choice_A, choice_B, choice_C, choice_D, choice_E : The five answer choices
  • answer : The correct answer choice (Either A, B, C, D, E)

Usage Notes

Future Direction

The current version of EHRNoteQA is tailored to align with the context length of currently available LLMs, categorizing the data into Level 1 (4k tokens) and Level 2 (8k tokens). As models capable of handling longer context lengths are developed and released, we plan to extend EHRNoteQA to include datasets with more admissions and longer discharge summaries. Additionally, while EHRNoteQA focuses on discharge summaries due to their frequent use in clinical practice, we recognize the importance of expanding our dataset to include other types of notes essential in healthcare settings, such as radiology notes and physician notes.

GitHub Repository for this Project

In our github repository [15], we provide code to preprocess MIMIC-IV discharge summaries (removing exceesive white spaces) and merge them with EHRNoteQA into a single data file.


For more detailed information about EHRNoteQA, please refer to our paper [16].

Release Notes

1.0.0 - Initial Release

1.0.1 - Minor Updates and Corrections

     - Added missing question mark
     - Corrected verb singular/plural forms
     - Fixed typo (changed "pateint" to "patient")


The authors have no ethics statement to declare.


This work was supported by the KAIST-NAVER Hyper-Creative AI Center, and the National Research Foundation of Korea (NRF) grant (NRF-2020H1D3A2A03100945) funded by the Korea government (MSIT)

Conflicts of Interest

The authors have no conflicts of interest to declare.


  1. Pampari, A., Raghavan, P., Liang, J., & Peng, J. (2018). emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732.
  2. Uzuner, Ö. (2009). Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 16(4), 561-570.
  3. Uzuner, Ö., Goldstein, I., Luo, Y., & Kohane, I. (2008). Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association, 15(1), 14-24.
  4. Uzuner, Ö., Solti, I., & Cadag, E. (2010). Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 17(5), 514-518.
  5. Uzuner, Ö., South, B. R., Shen, S., & DuVall, S. L. (2011). 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5), 552-556.
  6. Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2021, December). Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 580-587). IEEE.
  7. Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W. H., Feng, M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.
  8. Fan, J. (2019, June). Annotating and characterizing clinical sentences with explicit why-QA cues. In Proceedings of the 2nd Clinical Natural Language Processing Workshop (pp. 101-106).
  9. Moon, S., He, H., Jia, H., Liu, H., & Fan, J. W. (2023). Extractive clinical question-answering with multianswer and multifocus questions: Data set development and evaluation study. JMIR AI, 2(1), e41818.
  10. Lehman, E., Lialin, V., Legaspi, K. Y., Sy, A. J. R., Pile, P. T. S., Alberto, N. R. I., ... & Szolovits, P. (2022). Learning to ask like a physician. arXiv preprint arXiv:2206.02696.
  11. Henry, S., Buchan, K., Filannino, M., Stubbs, A., & Uzuner, O. (2020). 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association, 27(1), 3-12.
  12. Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet.
  13. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  14. Microsoft Azure: [Accessed 3/25/2024]
  15. EHRNoteQA Github. Available from :
  16. Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K., ... & Choi, E. (2024). EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings. arXiv preprint arXiv:2402.16040.

Parent Projects
EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.