Challenge Credentialed Access

ArchEHR-QA: A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization

Sarvesh Soni Dina Demner-Fushman

Published: Jan. 1, 2026. Version: 1.3


When using this resource, please cite: (show more options)
Soni, S., & Demner-Fushman, D. (2026). ArchEHR-QA: A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization (version 1.3). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/n708-sn25

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Patient’s unique information needs about their hospitalization can be addressed using clinical evidence from electronic health records (EHRs) and artificial intelligence (AI). However, robust datasets to assess the factuality and relevance of AI-generated responses are lacking and, to our knowledge, none capture patient information needs in the context of their EHRs. To address this gap, we introduce ArchEHR-QA, an expert-annotated dataset of 134 cases from intensive care unit and emergency department settings to evaluate the grounding capabilities of models for responding to patient-initiated queries. The dataset consists of patient-initiated questions posted in public domain, the corresponding clinician-interpreted questions, the excerpts of the EHRs annotated at the sentence-level with relevance to the question, and clinician-generated free-text answers to the questions grounded with EHR sentences. We collect true patient health information needs expressed in real-world health forum messages, we then align the messages to publicly accessible real EHRs. To our knowledge, this is the first public dataset that encapsulates patient questions and relevant clinical evidence from EHRs. We further provide an evaluation framework to assess two critical aspects of a grounded EHR QA system: does it identify relevant information from given clinical evidence and does it use this information in responding to user queries.


Background

Question answering (QA) is an organic way to interact with complex information systems such as electronic health records (EHRs) [1], where a QA system responds to user questions with exact answers. The major focus of existing EHR QA work has been on addressing clinician information needs [2], with datasets for system development and evaluation largely prioritizing these requirements. However, with the increasing patient-involvement in their care [3, 4], there is a need for targeted EHR QA research that incorporates the unique needs of patients from their health records [5]. To this end, datasets play an important role in developing and evaluating tailored artificial intelligence (AI) systems and, thus, the datasets must be representative of the target end-user's needs [6], i.e., patients.

Moreover, the volume of patient requests for medical information through patient portals is rising, contributing to desktop medicine and increasing clinician burden [7]. Most existing studies on automated responses to patient messages do not incorporate critical contextual information from EHRs [8, 9]. Among studies that use EHR content, none provide comprehensive evaluations of how effectively the generated responses leverage this clinical context [10, 11].

Grounding is crucial in AI applications in medicine, as it ensures that AI models are anchored to accurate, contextually relevant, real-world clinical data. This is particularly important when the intended audience lacks clinical expertise [12]. To effectively design and evaluate grounded QA systems, a representative dataset and evaluation framework are essential [13].


Participation

Overview

The ArchEHR-QA 2025 shared task was conducted as part of the 24th Biomedical Natural Language Processing (BioNLP) Workshop at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). The 2025 iteration of the shared task has concluded and is no longer accepting participants or submissions. This PhysioNet repository is maintained as a benchmark dataset for grounded question answering from electronic health records and as an archival record of the 2025 shared task. We provide the information below for completeness and to enable interested readers to explore the challenge framework and results.

Task Description

The ArchEHR-QA 2025 shared task invited participants to develop systems to automatically answer patients' questions given important clinical evidence from their electronic health records. Specifically, given a patient-posed natural language question, the corresponding clinician-interpreted question, and the patient's clinical note excerpt, the task was to generate a natural language answer with sentence-level citations to the specific clinical note sentences. A subset of 120 patient cases from the ArchEHR-QA dataset were used for the 2025 shared task: 20 cases were provided to participants for development and 100 cases were used for testing.

Participation Details

Submissions of system responses for the 2025 iteration of the shared task were made through the Codabench platform. Participants registered on the shared task’s Codabench competition page and submitted their system outputs according to the evaluation timeline. Participants were invited to submit papers describing their systems to the Proceedings of the 24th Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2025. All participants were also required to send a one-paragraph summary describing their best-performing system for inclusion in the shared task overview paper.

2025 Challenge Timeline

  • First call for participation: January 24, 2025
  • Release of the development dataset: February 26, 2025
  • Release of the test dataset: April 11, 2025
  • Submission of system responses: April 28, 2025
  • Submission of shared task papers (optional): May 9, 2025
  • Notification of acceptance: May 17, 2025
  • Camera-ready system papers due: May 27, 2025
  • BioNLP Workshop date: August 1, 2025

Resources

More details about the 2025 shared task, participation, and results can be found at the following links:


Data Description

The dataset consists of questions (inspired by real patient questions) and associated EHR data (derived from MIMIC-III [14] and MIMIC-IV [15] databases) containing important clinical evidence to answer these questions. Each instance of the question-note pairs is referred to as a "case". Clinical note excerpts are pre-annotated with sentence numbers, which are used to cite the clinical evidence sentences in clinician-authored answers. Each sentence is manually annotated with a "relevance" label to mark its importance in answering the given question as "essential", "supplementary", or "not-relevant".

The dataset contains a total of 134 patient cases: 34 in the development set and 100 in the test set.

Format

The dataset is provided as XML and JSON files.

Cases

The main data file, archehr-qa.xml, contains the cases in the following format:

<annotations>
    ...
    <case id="5">
<clinical_specialty>cardiology</clinical_specialty>         <patient_narrative> I am 48 years old. On February 20, I passed out, was taken to the hospital, and had two other episodes. I have chronic kidney disease with creatine around 1.5. I had anemia and hemoglobin was 10.3. I was in ICU 8 days and discharged in stable condition. My doctor performed a cardiac catherization. I had no increase in cardiac enzymes and an ECHO in the hospital showed 25% LVEF. Was this invasive, risky procedure necessary. </patient_narrative> <patient_question> <phrase id="0" start_char_index="254"> My doctor performed a cardiac catherization. </phrase> <phrase id="1" start_char_index="381"> Was this invasive, risky procedure necessary. </phrase> </patient_question> <clinician_question> Why was cardiac catheterization recommended to the patient? </clinician_question>        <note_excerpt> History of Present Illness: ... Brief Hospital Course: ... </note_excerpt> <note_excerpt_sentences> <sentence id="1" paragraph_id="1" start_char_index="0"> History of Present Illness: </sentence> <sentence id="2" paragraph_id="2" start_char_index="0"> ... </sentence> ... </note_excerpt_sentences>
    </case> ... </annotations>

Here, the XML elements represent:

  • <case>: each patient case with its "id".
  • <clinical_specialty>: clinical specialty of the case (separated by a pipe symbol (|), if more than one).
  • <patient_narrative>: full patient narrative question.
  • <patient_question>: key phrases in the narrative identified as focal points related to the patient’s question.
    • <phrase>: each annotated phrase with "id" and "start_char_index".
  • <clinician_question>: question posed by a clinician.
  • <note_excerpt>: clinical note excerpt serving as evidence.
  • <note_excerpt_sentences>: annotated sentences in the note excerpt.
    • <sentence>: each annotated sentence with "id", "paragraph_id", and "start_char_index" in the paragraph.

Keys and Mappings

The answer keys are present in archehr-qa_key.json, which is structured as follows:

[
    {
        "case_id": "5",
"clinician_answer": "The patient was recommended a cardiac catheterization for worsening heart failure confirmed by left ventricle ejection fraction of 25% on his echocardiogram [5,18,10]. He had low output heart failure which caused increasing intra-abdominal pressure resulting in congestive hepatopathy, abdominal pain, and right upper quadrant abdominal tenderness [11]. The cardiac catheterization showed the patient needed milrinone for treatment [19]. Milrinone infusion improved the patient's heart pump function by significantly improving cardiac output and wedge pressure [20,13].",         "answers": [ { "sentence_id": "1", "relevance": "essential" }, ... { "sentence_id": "6", "relevance": "supplementary" }, ... { "sentence_id": "8", "relevance": "not-relevant" }, ... ]     }, { ... } ]

Here, each dictionary in the JSON array contains:

  • "case_id": case ID from the XML data file.
  • "clinician_answer": answer to the question authored by a clinician with citations to specific note sentence IDs enclosed in square brackets ([]).
  • "answers": relevance labels for each sentence.
    • "sentence_id": sentence id from the pre-annotated sentences in the XML data file.
    • "relevance": annotated label indicating the importance of this sentence to answer the given question.

The ID mappings are available in archehr-qa_mapping.json, which is structured as follows:

[
    {
        "case_id": "1",
        "document_id": "179164_41762",
        "document_source": "mimic-iii"
    },
    ...
    {
        "case_id": "11",
        "document_id": "22805349",
        "document_source": "mimic-iv"
    },
    ...
]

Here, each dictionary in the JSON array has:

  • "case_id": case ID from the XML data file.
  • "document_id": refers to note ID in the corresponding database -- {HADM_ID}_{ROW_ID} for "mimic-iii" and {hadm_id} for "mimic-iv"
  • "document_source": version of the mimic database this note was sourced from

Statistics

Table 1. Descriptive statistics of the cases in the dataset. ICU: Intensive Care Unit, ED: Emergency Department. Mean values are in count (proportion) format.

Category Value ICU (N=104) ED (N=30) All (N=134)
Patient Narrative Word Count Mean 90.2 94.8 91.3
Median 72.0 90.5 76.5
Std Dev 61.7 36.6 56.9
Min 33 54 33
Max 440 192 440
Clinician Question Word Count Mean 10.6 10.2 10.5
Median 10.0 9.0 10.0
Std Dev 3.6 3.8 3.6
Min 3 4 3
Max 21 21 21
Answer Word Count Mean 72.6 72.3 72.6
Median 73.0 73.5 73.0
Std Dev 3.4 3.4 3.4
Min 55 61 55
Max 78 75 78
Note Excerpt Word Count Mean 410.2 280.7 381.2
Median 383.5 223.0 351.5
Std Dev 200.1 196.8 205.9
Min 107 76 76
Max 1028 868 1028
Mean
Note Sentences Count
All 27.6 19.3 25.7
Essential 7.0 (25.5%) 5.2 (26.7%) 6.6 (25.7%)
Supplementary 5.9 (21.4%) 2.6 (13.4%) 5.2 (20.1%)
Not Required 14.7 (53.1%) 11.6 (59.8%) 14.0 (54.3%)

Table 2. Clinical specialties of the cases in the dataset, stratified by the different clinical settings. ICU: Intensive Care Unit, ED: Emergency Department. Values are in count (proportion) format.

Clinical Specialty ICU (N=104) ED (N=30) All (N=134)
Cardiology 24 (23.1%) 2 (6.7%) 26 (19.4%)
Neurology 18 (17.3%) 7 (23.3%) 25 (18.7%)
Pulmonology 15 (14.4%) 3 (10.0%) 18 (13.4%)
Infectious Diseases 11 (10.6%) 3 (10.0%) 14 (10.4%)
Gastroenterology 7 (6.7%) 4 (13.3%) 11 (8.2%)
Cardiovascular 8 (7.7%) 0 (0.0%) 8 (6.0%)
Hematology 5 (4.8%) 1 (3.3%) 6 (4.5%)
Oncology 5 (4.8%) 1 (3.3%) 6 (4.5%)
Hepatology 5 (4.8%) 0 (0.0%) 5 (3.7%)
Nephrology 5 (4.8%) 0 (0.0%) 5 (3.7%)
Traumatology 3 (2.9%) 2 (6.7%) 5 (3.7%)
Pain Management 1 (1.0%) 3 (10.0%) 4 (3.0%)
Psychiatry 2 (1.9%) 2 (6.7%) 4 (3.0%)
Rehabilitation 1 (1.0%) 3 (10.0%) 4 (3.0%)
Urology 3 (2.9%) 1 (3.3%) 4 (3.0%)
Endocrinology 3 (2.9%) 0 (0.0%) 3 (2.2%)
Immunology 2 (1.9%) 0 (0.0%) 2 (1.5%)
Internal Medicine 1 (1.0%) 1 (3.3%) 2 (1.5%)
Obstetrics 1 (1.0%) 1 (3.3%) 2 (1.5%)
Toxicology 2 (1.9%) 0 (0.0%) 2 (1.5%)
Genetics 1 (1.0%) 0 (0.0%) 1 (0.7%)
Gynecology 0 (0.0%) 1 (3.3%) 1 (0.7%)
Neuropsychology 1 (1.0%) 0 (0.0%) 1 (0.7%)
Neurosurgery 1 (1.0%) 0 (0.0%) 1 (0.7%)
Rheumatology 0 (0.0%) 1 (3.3%) 1 (0.7%)

Evaluation

System-generated responses are evaluated along two dimensions: Factuality (use of clinical evidence for grounding) and Relevance (similarity to the ground-truth answer).

Factuality is measured using an F1 Score between the sentences cited as evidence in the system-generated answer (which are treated as predicted essential note sentences) and the ground truth relevance labels for note sentences. We define two versions of Factuality: “Essential-only” and “Essential + Supplementary”. In the “Essential-only” definition, only sentences labeled as essential in the ground truth count as positives. In the “Essential + Supplementary” version, ground truth sentences labeled as either essential or supplementary are counted as positives (penalizing the system for failing to cite either, but not for including supplementary ones). We report Factuality both at the macro level (averaging per-case F1 scores) and the micro level (aggregating true positives, false positives, and false negatives across all cases). We designate the essential-only micro F1 Score as Overall Factuality and use it to calculate the Overall Score, as it captures aggregate performance across all instances focusing on the most important note sentences.

Relevance is evaluated by comparing the generated answer text to a clinician-authored reference answer using text- and semantics-based metrics: BLEU [16], ROUGE [17], SARI [18], BERTScore [19], AlignScore [20], and MEDCON [21]. Each metric is normalized, and Overall Relevance is computed as the mean of these normalized scores. We compute another version of relevance scores by treating the set of essential ground truth sentences (together with the original question) as the reference. This alternative provides a feasible and scalable approximation for evaluating answer quality in the absence of human-written answers.


Release Notes

Version 1.3: Release a total of 134 patient cases. Add clinical_specialty field to each case indicating the clinical specialty(ies) of the case, and clinician_answer field containing clinician-authored answers to the questions with sentence-level citations to clinical evidence.

Version 1.2: Release the test dataset with 100 patient cases. Adjust sentence and paragraph indices in the development dataset to start from 1 instead of 0.

Version 1.1: Update the development dataset by removing extraneous questions.

Version 1.0: Release the development dataset with 20 patient cases.


Ethics

All members of the organizing team completed the required training to access and are credentialed users of MIMIC-III and MIMIC-IV databases.


Acknowledgements

This work was supported by the Division of Intramural Research (DIR) of the National Library of Medicine (NLM), National Institutes of Health, and utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Ely, J. W., Osheroff, J. A., Chambliss, M. L., Ebell, M. H. & Rosenbaum, M. E. Answering Physicians’ Clinical Questions: Obstacles and Potential Solutions. J. Am. Med. Inform. Assoc. 12, 217–224 (2005).
  2. Bardhan, J., Roberts, K. & Wang, D. Z. Question Answering for Electronic Health Records: Scoping Review of Datasets and Models. J. Med. Internet Res. 26, e53636 (2024).
  3. Fisher, B., Bhavnani, V. & Winfield, M. How patients use access to their full health records: A qualitative study of patients in general practice. J. R. Soc. Med. 102, 538–544 (2009).
  4. Woods, S. S. et al. Patient experiences with full electronic access to health records and clinical notes through the my healthevet personal health record pilot: Qualitative study. J. Med. Internet Res. 15, 403 (2013).
  5. Pieper, B. et al. Discharge Information Needs of Patients After Surgery. J. Wound. Ostomy Continence Nurs. 33, 281 (2006).
  6. Arora, A. et al. The value of standards for health datasets in artificial intelligence-based applications. Nat. Med. 29, 2929–2938 (2023).
  7. Martinez, K. A., Schulte, R., Rothberg, M. B., Tang, M. C. & Pfoh, E. R. Patient Portal Message Volume and Time Spent on the EHR: an Observational Study of Primary Care Clinicians. J. Gen. Intern. Med. 39, 566–572 (2024).
  8. Liu, S. et al. Leveraging large language models for generating responses to patient messages—a subjective analysis. J. Am. Med. Inform. Assoc. ocae052 (2024)
  9. Biro, J. M. et al. Opportunities and risks of artificial intelligence in patient portal messaging in primary care. Npj Digit. Med. 8, 1–6 (2025).
  10. Small, W. R. et al. Large Language Model–Based Responses to Patients’ In-Basket Messages. JAMA Netw. Open 7, e2422399 (2024).
  11. Garcia, P. et al. Artificial Intelligence–Generated Draft Replies to Patient Inbox Messages. JAMA Netw. Open 7, e243201 (2024).
  12. Haug, C. J. & Drazen, J. M. Artificial Intelligence and Machine Learning in Clinical Medicine, 2023. N. Engl. J. Med. 388, 1201–1208 (2023).
  13. Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and Adoption of Large Language Models in Medicine. JAMA (2023).
  14. Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
  15. Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).
  16. Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. in Proceedings of the 40th annual meeting on association for computational linguistics 311–318 (Association for Computational Linguistics, 2002).
  17. Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. in Text Summarization Branches Out 74–81 (Association for Computational Linguistics, Barcelona, Spain, 2004).
  18. Xu, W., Napoles, C., Pavlick, E., Chen, Q. & Callison-Burch, C. Optimizing Statistical Machine Translation for Text Simplification. Trans. Assoc. Comput. Linguist. 4, 401–415 (2016).
  19. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating Text Generation with BERT. in International Conference on Learning Representations (2019).
  20. Zha, Y., Yang, Y., Li, R. & Hu, Z. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Rogers, A., Boyd-Graber, J. & Okazaki, N.) 11328–11348 (Association for Computational Linguistics, Toronto, Canada, 2023).
  21. Yim, W. et al. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Sci. Data 10, 586 (2023).

Parent Projects
ArchEHR-QA: A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files