Challenge Credentialed Access
ArchEHR-QA: A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization
Sarvesh Soni , Dina Demner-Fushman
Published: Jan. 1, 2026. Version: 1.3
When using this resource, please cite:
(show more options)
Soni, S., & Demner-Fushman, D. (2026). ArchEHR-QA: A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization (version 1.3). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/n708-sn25
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
Patient’s unique information needs about their hospitalization can be addressed using clinical evidence from electronic health records (EHRs) and artificial intelligence (AI). However, robust datasets to assess the factuality and relevance of AI-generated responses are lacking and, to our knowledge, none capture patient information needs in the context of their EHRs. To address this gap, we introduce ArchEHR-QA, an expert-annotated dataset of 134 cases from intensive care unit and emergency department settings to evaluate the grounding capabilities of models for responding to patient-initiated queries. The dataset consists of patient-initiated questions posted in public domain, the corresponding clinician-interpreted questions, the excerpts of the EHRs annotated at the sentence-level with relevance to the question, and clinician-generated free-text answers to the questions grounded with EHR sentences. We collect true patient health information needs expressed in real-world health forum messages, we then align the messages to publicly accessible real EHRs. To our knowledge, this is the first public dataset that encapsulates patient questions and relevant clinical evidence from EHRs. We further provide an evaluation framework to assess two critical aspects of a grounded EHR QA system: does it identify relevant information from given clinical evidence and does it use this information in responding to user queries.
Background
Question answering (QA) is an organic way to interact with complex information systems such as electronic health records (EHRs) [1], where a QA system responds to user questions with exact answers. The major focus of existing EHR QA work has been on addressing clinician information needs [2], with datasets for system development and evaluation largely prioritizing these requirements. However, with the increasing patient-involvement in their care [3, 4], there is a need for targeted EHR QA research that incorporates the unique needs of patients from their health records [5]. To this end, datasets play an important role in developing and evaluating tailored artificial intelligence (AI) systems and, thus, the datasets must be representative of the target end-user's needs [6], i.e., patients.
Moreover, the volume of patient requests for medical information through patient portals is rising, contributing to desktop medicine and increasing clinician burden [7]. Most existing studies on automated responses to patient messages do not incorporate critical contextual information from EHRs [8, 9]. Among studies that use EHR content, none provide comprehensive evaluations of how effectively the generated responses leverage this clinical context [10, 11].
Grounding is crucial in AI applications in medicine, as it ensures that AI models are anchored to accurate, contextually relevant, real-world clinical data. This is particularly important when the intended audience lacks clinical expertise [12]. To effectively design and evaluate grounded QA systems, a representative dataset and evaluation framework are essential [13].
Participation
Overview
The ArchEHR-QA 2025 shared task was conducted as part of the 24th Biomedical Natural Language Processing (BioNLP) Workshop at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). The 2025 iteration of the shared task has concluded and is no longer accepting participants or submissions. This PhysioNet repository is maintained as a benchmark dataset for grounded question answering from electronic health records and as an archival record of the 2025 shared task. We provide the information below for completeness and to enable interested readers to explore the challenge framework and results.
Task Description
The ArchEHR-QA 2025 shared task invited participants to develop systems to automatically answer patients' questions given important clinical evidence from their electronic health records. Specifically, given a patient-posed natural language question, the corresponding clinician-interpreted question, and the patient's clinical note excerpt, the task was to generate a natural language answer with sentence-level citations to the specific clinical note sentences. A subset of 120 patient cases from the ArchEHR-QA dataset were used for the 2025 shared task: 20 cases were provided to participants for development and 100 cases were used for testing.
Participation Details
Submissions of system responses for the 2025 iteration of the shared task were made through the Codabench platform. Participants registered on the shared task’s Codabench competition page and submitted their system outputs according to the evaluation timeline. Participants were invited to submit papers describing their systems to the Proceedings of the 24th Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2025. All participants were also required to send a one-paragraph summary describing their best-performing system for inclusion in the shared task overview paper.
2025 Challenge Timeline
- First call for participation: January 24, 2025
- Release of the development dataset: February 26, 2025
- Release of the test dataset: April 11, 2025
- Submission of system responses: April 28, 2025
- Submission of shared task papers (optional): May 9, 2025
- Notification of acceptance: May 17, 2025
- Camera-ready system papers due: May 27, 2025
- BioNLP Workshop date: August 1, 2025
Resources
More details about the 2025 shared task, participation, and results can be found at the following links:
- Shared task website (task description, schedule, and evaluation details): https://archehr-qa.github.io
- Codabench competition page (archived leaderboard with submission results): https://www.codabench.org/competitions/5302/
- Overview paper (ArchEHR-QA 2025 shared task results): https://aclanthology.org/2025.bionlp-1.34/
Data Description
The dataset consists of questions (inspired by real patient questions) and associated EHR data (derived from MIMIC-III [14] and MIMIC-IV [15] databases) containing important clinical evidence to answer these questions. Each instance of the question-note pairs is referred to as a "case". Clinical note excerpts are pre-annotated with sentence numbers, which are used to cite the clinical evidence sentences in clinician-authored answers. Each sentence is manually annotated with a "relevance" label to mark its importance in answering the given question as "essential", "supplementary", or "not-relevant".
The dataset contains a total of 134 patient cases: 34 in the development set and 100 in the test set.
Format
The dataset is provided as XML and JSON files.
Cases
The main data file, archehr-qa.xml, contains the cases in the following format:
<annotations>
...
<case id="5">
<clinical_specialty>cardiology</clinical_specialty>
<patient_narrative>
I am 48 years old. On February 20, I passed out, was taken to the hospital, and had two other episodes. I have chronic kidney disease with creatine around 1.5. I had anemia and hemoglobin was 10.3. I was in ICU 8 days and discharged in stable condition. My doctor performed a cardiac catherization. I had no increase in cardiac enzymes and an ECHO in the hospital showed 25% LVEF. Was this invasive, risky procedure necessary.
</patient_narrative>
<patient_question>
<phrase id="0" start_char_index="254">
My doctor performed a cardiac catherization.
</phrase>
<phrase id="1" start_char_index="381">
Was this invasive, risky procedure necessary.
</phrase>
</patient_question>
<clinician_question>
Why was cardiac catheterization recommended to the patient?
</clinician_question>
<note_excerpt>
History of Present Illness:
...
Brief Hospital Course:
...
</note_excerpt>
<note_excerpt_sentences>
<sentence id="1" paragraph_id="1" start_char_index="0">
History of Present Illness:
</sentence>
<sentence id="2" paragraph_id="2" start_char_index="0">
...
</sentence>
...
</note_excerpt_sentences>
</case>
...
</annotations>
Here, the XML elements represent:
<case>: each patient case with its"id".<clinical_specialty>: clinical specialty of the case (separated by a pipe symbol (|), if more than one).<patient_narrative>: full patient narrative question.<patient_question>: key phrases in the narrative identified as focal points related to the patient’s question.<phrase>: each annotated phrase with"id"and"start_char_index".
<clinician_question>: question posed by a clinician.<note_excerpt>: clinical note excerpt serving as evidence.<note_excerpt_sentences>: annotated sentences in the note excerpt.<sentence>: each annotated sentence with"id","paragraph_id", and"start_char_index"in the paragraph.
Keys and Mappings
The answer keys are present in archehr-qa_key.json, which is structured as follows:
[
{
"case_id": "5",
"clinician_answer": "The patient was recommended a cardiac catheterization for worsening heart failure confirmed by left ventricle ejection fraction of 25% on his echocardiogram [5,18,10]. He had low output heart failure which caused increasing intra-abdominal pressure resulting in congestive hepatopathy, abdominal pain, and right upper quadrant abdominal tenderness [11]. The cardiac catheterization showed the patient needed milrinone for treatment [19]. Milrinone infusion improved the patient's heart pump function by significantly improving cardiac output and wedge pressure [20,13].",
"answers": [
{
"sentence_id": "1",
"relevance": "essential"
},
...
{
"sentence_id": "6",
"relevance": "supplementary"
},
...
{
"sentence_id": "8",
"relevance": "not-relevant"
},
...
]
},
{
...
}
]
Here, each dictionary in the JSON array contains:
"case_id": case ID from the XML data file."clinician_answer": answer to the question authored by a clinician with citations to specific note sentence IDs enclosed in square brackets ([])."answers": relevance labels for each sentence."sentence_id": sentence id from the pre-annotated sentences in the XML data file."relevance": annotated label indicating the importance of this sentence to answer the given question.
The ID mappings are available in archehr-qa_mapping.json, which is structured as follows:
[
{
"case_id": "1",
"document_id": "179164_41762",
"document_source": "mimic-iii"
},
...
{
"case_id": "11",
"document_id": "22805349",
"document_source": "mimic-iv"
},
...
]
Here, each dictionary in the JSON array has:
"case_id": case ID from the XML data file."document_id": refers to note ID in the corresponding database --{HADM_ID}_{ROW_ID}for"mimic-iii"and{hadm_id}for"mimic-iv""document_source": version of the mimic database this note was sourced from
Statistics
Table 1. Descriptive statistics of the cases in the dataset. ICU: Intensive Care Unit, ED: Emergency Department. Mean values are in count (proportion) format.
| Category | Value | ICU (N=104) | ED (N=30) | All (N=134) |
| Patient Narrative Word Count | Mean | 90.2 | 94.8 | 91.3 |
| Median | 72.0 | 90.5 | 76.5 | |
| Std Dev | 61.7 | 36.6 | 56.9 | |
| Min | 33 | 54 | 33 | |
| Max | 440 | 192 | 440 | |
| Clinician Question Word Count | Mean | 10.6 | 10.2 | 10.5 |
| Median | 10.0 | 9.0 | 10.0 | |
| Std Dev | 3.6 | 3.8 | 3.6 | |
| Min | 3 | 4 | 3 | |
| Max | 21 | 21 | 21 | |
| Answer Word Count | Mean | 72.6 | 72.3 | 72.6 |
| Median | 73.0 | 73.5 | 73.0 | |
| Std Dev | 3.4 | 3.4 | 3.4 | |
| Min | 55 | 61 | 55 | |
| Max | 78 | 75 | 78 | |
| Note Excerpt Word Count | Mean | 410.2 | 280.7 | 381.2 |
| Median | 383.5 | 223.0 | 351.5 | |
| Std Dev | 200.1 | 196.8 | 205.9 | |
| Min | 107 | 76 | 76 | |
| Max | 1028 | 868 | 1028 | |
| Mean Note Sentences Count |
All | 27.6 | 19.3 | 25.7 |
| Essential | 7.0 (25.5%) | 5.2 (26.7%) | 6.6 (25.7%) | |
| Supplementary | 5.9 (21.4%) | 2.6 (13.4%) | 5.2 (20.1%) | |
| Not Required | 14.7 (53.1%) | 11.6 (59.8%) | 14.0 (54.3%) |
Table 2. Clinical specialties of the cases in the dataset, stratified by the different clinical settings. ICU: Intensive Care Unit, ED: Emergency Department. Values are in count (proportion) format.
| Clinical Specialty | ICU (N=104) | ED (N=30) | All (N=134) |
| Cardiology | 24 (23.1%) | 2 (6.7%) | 26 (19.4%) |
| Neurology | 18 (17.3%) | 7 (23.3%) | 25 (18.7%) |
| Pulmonology | 15 (14.4%) | 3 (10.0%) | 18 (13.4%) |
| Infectious Diseases | 11 (10.6%) | 3 (10.0%) | 14 (10.4%) |
| Gastroenterology | 7 (6.7%) | 4 (13.3%) | 11 (8.2%) |
| Cardiovascular | 8 (7.7%) | 0 (0.0%) | 8 (6.0%) |
| Hematology | 5 (4.8%) | 1 (3.3%) | 6 (4.5%) |
| Oncology | 5 (4.8%) | 1 (3.3%) | 6 (4.5%) |
| Hepatology | 5 (4.8%) | 0 (0.0%) | 5 (3.7%) |
| Nephrology | 5 (4.8%) | 0 (0.0%) | 5 (3.7%) |
| Traumatology | 3 (2.9%) | 2 (6.7%) | 5 (3.7%) |
| Pain Management | 1 (1.0%) | 3 (10.0%) | 4 (3.0%) |
| Psychiatry | 2 (1.9%) | 2 (6.7%) | 4 (3.0%) |
| Rehabilitation | 1 (1.0%) | 3 (10.0%) | 4 (3.0%) |
| Urology | 3 (2.9%) | 1 (3.3%) | 4 (3.0%) |
| Endocrinology | 3 (2.9%) | 0 (0.0%) | 3 (2.2%) |
| Immunology | 2 (1.9%) | 0 (0.0%) | 2 (1.5%) |
| Internal Medicine | 1 (1.0%) | 1 (3.3%) | 2 (1.5%) |
| Obstetrics | 1 (1.0%) | 1 (3.3%) | 2 (1.5%) |
| Toxicology | 2 (1.9%) | 0 (0.0%) | 2 (1.5%) |
| Genetics | 1 (1.0%) | 0 (0.0%) | 1 (0.7%) |
| Gynecology | 0 (0.0%) | 1 (3.3%) | 1 (0.7%) |
| Neuropsychology | 1 (1.0%) | 0 (0.0%) | 1 (0.7%) |
| Neurosurgery | 1 (1.0%) | 0 (0.0%) | 1 (0.7%) |
| Rheumatology | 0 (0.0%) | 1 (3.3%) | 1 (0.7%) |
Evaluation
System-generated responses are evaluated along two dimensions: Factuality (use of clinical evidence for grounding) and Relevance (similarity to the ground-truth answer).
Factuality is measured using an F1 Score between the sentences cited as evidence in the system-generated answer (which are treated as predicted essential note sentences) and the ground truth relevance labels for note sentences. We define two versions of Factuality: “Essential-only” and “Essential + Supplementary”. In the “Essential-only” definition, only sentences labeled as essential in the ground truth count as positives. In the “Essential + Supplementary” version, ground truth sentences labeled as either essential or supplementary are counted as positives (penalizing the system for failing to cite either, but not for including supplementary ones). We report Factuality both at the macro level (averaging per-case F1 scores) and the micro level (aggregating true positives, false positives, and false negatives across all cases). We designate the essential-only micro F1 Score as Overall Factuality and use it to calculate the Overall Score, as it captures aggregate performance across all instances focusing on the most important note sentences.
Relevance is evaluated by comparing the generated answer text to a clinician-authored reference answer using text- and semantics-based metrics: BLEU [16], ROUGE [17], SARI [18], BERTScore [19], AlignScore [20], and MEDCON [21]. Each metric is normalized, and Overall Relevance is computed as the mean of these normalized scores. We compute another version of relevance scores by treating the set of essential ground truth sentences (together with the original question) as the reference. This alternative provides a feasible and scalable approximation for evaluating answer quality in the absence of human-written answers.
Release Notes
Version 1.3: Release a total of 134 patient cases. Add clinical_specialty field to each case indicating the clinical specialty(ies) of the case, and clinician_answer field containing clinician-authored answers to the questions with sentence-level citations to clinical evidence.
Version 1.2: Release the test dataset with 100 patient cases. Adjust sentence and paragraph indices in the development dataset to start from 1 instead of 0.
Version 1.1: Update the development dataset by removing extraneous questions.
Version 1.0: Release the development dataset with 20 patient cases.
Ethics
All members of the organizing team completed the required training to access and are credentialed users of MIMIC-III and MIMIC-IV databases.
Acknowledgements
This work was supported by the Division of Intramural Research (DIR) of the National Library of Medicine (NLM), National Institutes of Health, and utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Ely, J. W., Osheroff, J. A., Chambliss, M. L., Ebell, M. H. & Rosenbaum, M. E. Answering Physicians’ Clinical Questions: Obstacles and Potential Solutions. J. Am. Med. Inform. Assoc. 12, 217–224 (2005).
- Bardhan, J., Roberts, K. & Wang, D. Z. Question Answering for Electronic Health Records: Scoping Review of Datasets and Models. J. Med. Internet Res. 26, e53636 (2024).
- Fisher, B., Bhavnani, V. & Winfield, M. How patients use access to their full health records: A qualitative study of patients in general practice. J. R. Soc. Med. 102, 538–544 (2009).
- Woods, S. S. et al. Patient experiences with full electronic access to health records and clinical notes through the my healthevet personal health record pilot: Qualitative study. J. Med. Internet Res. 15, 403 (2013).
- Pieper, B. et al. Discharge Information Needs of Patients After Surgery. J. Wound. Ostomy Continence Nurs. 33, 281 (2006).
- Arora, A. et al. The value of standards for health datasets in artificial intelligence-based applications. Nat. Med. 29, 2929–2938 (2023).
- Martinez, K. A., Schulte, R., Rothberg, M. B., Tang, M. C. & Pfoh, E. R. Patient Portal Message Volume and Time Spent on the EHR: an Observational Study of Primary Care Clinicians. J. Gen. Intern. Med. 39, 566–572 (2024).
- Liu, S. et al. Leveraging large language models for generating responses to patient messages—a subjective analysis. J. Am. Med. Inform. Assoc. ocae052 (2024)
- Biro, J. M. et al. Opportunities and risks of artificial intelligence in patient portal messaging in primary care. Npj Digit. Med. 8, 1–6 (2025).
- Small, W. R. et al. Large Language Model–Based Responses to Patients’ In-Basket Messages. JAMA Netw. Open 7, e2422399 (2024).
- Garcia, P. et al. Artificial Intelligence–Generated Draft Replies to Patient Inbox Messages. JAMA Netw. Open 7, e243201 (2024).
- Haug, C. J. & Drazen, J. M. Artificial Intelligence and Machine Learning in Clinical Medicine, 2023. N. Engl. J. Med. 388, 1201–1208 (2023).
- Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and Adoption of Large Language Models in Medicine. JAMA (2023).
- Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
- Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).
- Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: a method for automatic evaluation of machine translation. in Proceedings of the 40th annual meeting on association for computational linguistics 311–318 (Association for Computational Linguistics, 2002).
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. in Text Summarization Branches Out 74–81 (Association for Computational Linguistics, Barcelona, Spain, 2004).
- Xu, W., Napoles, C., Pavlick, E., Chen, Q. & Callison-Burch, C. Optimizing Statistical Machine Translation for Text Simplification. Trans. Assoc. Comput. Linguist. 4, 401–415 (2016).
- Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating Text Generation with BERT. in International Conference on Learning Representations (2019).
- Zha, Y., Yang, Y., Li, R. & Hu, Z. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Rogers, A., Boyd-Graber, J. & Okazaki, N.) 11328–11348 (Association for Computational Linguistics, Toronto, Canada, 2023).
- Yim, W. et al. Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Sci. Data 10, 586 (2023).
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.3):
https://doi.org/10.13026/n708-sn25
DOI (latest version):
https://doi.org/10.13026/zzax-sy62
Topics:
question answering
electronic health record
patient portals
clinicians
Project Website:
https://archehr-qa.github.io
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project