Database Credentialed Access

MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters

Amin Dada Osman Alperen Koras Marie Bauer Amanda Butler Kaleb Smith Jens Kleesiek Julian Friedrich

Published: May 5, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Dada, A., Koras, O. A., Bauer, M., Butler, A., Smith, K., Kleesiek, J., & Friedrich, J. (2025). MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters (version 1.0.0). PhysioNet. https://doi.org/10.13026/f566-h049.

Additionally, please cite the original publication:

Dada, A., Koras, O., Bauer, M., Butler, A., Smith, K., Kleesiek, J., & Friedrich, J. (2025). MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters. arXiv preprint arXiv:2502.03298.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

While increasing patients' access to medical documents improves medical care, this benefit is limited by varying health literacy levels and complex medical terminology. Large language models (LLMs) offer solutions by simplifying medical information. However, evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline combining LLM-based question-answer generation with manual quality checks. We use this dataset to evaluate various LLMs on patient-oriented question-answering. Our findings reveal that general-purpose LLMs frequently surpass biomedical-adapted models, while automated metrics correlate with human judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the development of LLMs to enhance patient understanding and ultimately improve care outcomes.


Background

While several clinical QA datasets exist [1–6], none, to the best of our knowledge, are explicitly designed for patient-oriented use.

Prior research has explored medical text simplification, but did not focus on helping patients understand clinical documents in a QA format. Aali et al. [7] developed a public dataset that converts MIMIC hospital course summaries into concise discharge letters. Campillos-Llanos et al. [8] created a Spanish dataset for simplifying clinical trial texts, demonstrating the importance of multilingual resources. Trienes et al. [9] focused on making pathology reports more understandable for patients, though their dataset remains private and does not address everyday clinical questions. Similarly, while the MeQSum dataset [10] transforms consumer health questions into brief medical queries, it lacks strong clinical focus.

Our work addresses these limitations by introducing a public, patient-centered QA dataset based on clinical MIMIC-IV discharge summaries, creating a benchmark to evaluate LLMs.


Methods

In the MIMIC-IV dataset [11], some discharge summaries conclude with a discharge letter that summarizes key information and follow-up instructions in patient-friendly language. We used these discharge letters as the foundation for generating QA pairs in the following manner:

First, we identified discharge summaries containing discharge letters by searching for the string "You were admitted to the hospital'' that indicates the start of a discharge letter.
We split each discharge letter into sentences using Meta's Llama-3-70B-Instruct [12], which proved more accurate than traditional sentence splitters like NLTK, especially when handling irregular formatting and placeholders introduced by anonymization. To ensure accuracy, we prompted the LLM to preserve the original sentence structure and wording, which we subsequently verified by confirming that each processed sentence could be matched exactly with its source in the original discharge letter via exact string matching.

Afterwards, we fed these sentences into Llama-3-70B-Instruct to generate matching questions from a patient's perspective. The LLM was allowed to reformulate the answer to match the question, but was instructed to reference the source sentence. We then checked these references to confirm that no information from the source document was altered. Since the answers are directly derived from the discharge letters written by medical professionals, this method maintains both medical accuracy and patient-friendly language.

All resulting QA candidates were then manually reviewed by a single physician who selected high-quality examples based on the following criteria:

  • Factual correctness: Question-answer pairs had to be logically connected. Answers that did not match their questions (e.g., "What medication should I avoid taking due to a possible allergy?" - "You were prescribed ibuprofen") were excluded.

  • Completeness: Answers had to be complete. Partial answers (e.g., "What medications were started for me?" - "You were started on Vancomycin 1gm IV every 24 hours" when additional antibiotics were prescribed) were discarded.

  • Safety: Answers needed to be safe. Potentially harmful instructions (e.g., "Take Coumadin 3 mg daily" without mentioning INR monitoring) were excluded.

  • Consistency: Questions had to be answerable from both the discharge letter and discharge summary. Questions whose answers relied solely on information from the discharge letter were excluded.

  • Complexity: Question-answer pairs had to be sufficiently complex. Obvious answers or overly specific questions that gave the answer away (e.g. "Did I receive Ciprofloxacin?" - "You received Ciprofloxacin.") were excluded.

As a final step, we removed the discharge letters from their summaries and combined the remaining summaries with their matching QA pairs. This resulted in three components, forming MeDiSumQA:

  • A question that serves as input for LLMs.

  • An abbreviated discharge summary without the discharge letter that LLMs use to answer the input question.

  • A ground truth answer for comparison with generated responses.


Data Description

Initially, we generated 500 QA pairs, which were reduced to 416 pairs after manual curation.

This dataset consists of structured question-answer pairs derived from discharge summaries in the MIMIC-IV-Note database. Each record is represented as a JSON object with the following fields:

  • note_id (string): A unique identifier referencing a discharge summary note in the MIMIC-IV-Note dataset.;

  • Question (string): A natural language question regarding the patient's hospital stay.

  • Answer (string): A natural language response extracted from the corresponding discharge summary, providing clinically relevant information.

  • discharge_summary (string, optional): The full text of the discharge summary from MIMIC-IV-Note without the discharge letter, serving as the source document for the question-answer pair.

Usage Notes

MeDiSumQA provides a valuable resource for evaluating large language models (LLMs) in generating patient-friendly medical information. This dataset is particularly useful for researchers, developers, and healthcare professionals seeking to improve the accessibility of clinical documents for patients. Below, we outline its potential applications, previous use cases, limitations, and complementary resources that may enhance its utility.

Prior Use Cases:

MeDiSumQA has been used to benchmark LLMs[13–15], particularly in assessing their ability to extract relevant and comprehensible information from clinical discharge summaries. Our findings suggest that general-purpose LLMs often outperform biomedical-adapted models in this domain. Additionally, our study demonstrated strong correlations between automated metrics and human judgment, supporting the dataset's reliability for further research.

Reuse Potential:

  • LLM Evaluation: Researchers developing or fine-tuning LLMs can use MeDiSumQA as a benchmark dataset to assess model performance in patient-centered medical QA tasks.

  • Medical Text Simplification: The dataset can support research on simplifying clinical documentation, ensuring patients receive clear and actionable healthcare information.

  • Healthcare Chatbots: Developers of AI-driven healthcare assistants can utilize MeDiSumQA to train models that answer patient questions based on structured clinical data.

  • Explainability Research: The dataset facilitates studies on how LLMs extract and present information, contributing to advancements in AI transparency and trustworthiness in healthcare applications.

Limitations:

  • Dataset Scope: MeDiSumQA is derived exclusively from MIMIC-IV discharge summaries, limiting its applicability to other types of clinical notes or healthcare settings.

  • Data Size: The manually curated dataset contains 416 QA pairs, which, while high quality, is not sufficient for training large-scale models.

  • Domain-Specificity: While discharge letters are written in patient-friendly language, some QA pairs may still require further simplification for individuals with low health literacy.

  • Generalization to Non-English Texts: The dataset is in English, making it less directly applicable to multilingual settings without translation or adaptation efforts.

By making MeDiSumQA publicly available on PhysioNet, we aim to foster innovation in LLM-driven patient education and healthcare communication, ultimately enhancing patient understanding and engagement in their care.


Ethics

All data generation and evaluation was conducted in a secure environment with locally hosted models to ensure data privacy and protection. All access and use was based on credentialed, de-identified data as permitted by the original data providers.


Acknowledgements

This project is building on the MIMIC-IV [11] dataset. We want to thank the MIMIC-IV authors for enabling this project.


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Pampari A, Raghavan P, Liang J, Peng J (2018). "emrQA: a large corpus for question answering on electronic medical records". In: Riloff E, Chiang D, Hockenmaier J, Tsujii J, editors. Proceedings of the 2018 conference on empirical methods in natural language processing. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 2357–68. Available from: https://aclanthology.org/D18-1258/
  2. Lehman E, Lialin V, Legaspi KE, Sy AJ, Pile PT, Alberto NR, et al. (2022). "Learning to ask like a physician". In: Naumann T, Bethard S, Roberts K, Rumshisky A, editors. Proceedings of the 4th clinical natural language processing workshop. Seattle, WA: Association for Computational Linguistics; 2022. p. 74–86. Available from: https://aclanthology.org/2022.clinicalnlp-1.8/
  3. Soni S, Gudala M, Pajouhi A, Roberts K (2022). "RadQA: a question answering dataset to improve comprehension of radiology reports". In: Calzolari N, Béchet F, Blache P, Choukri K, Cieri C, Declerck T, et al., editors. Proceedings of the thirteenth language resources and evaluation conference. Marseille, France: European Language Resources Association; 2022. p. 6250–9. Available from: https://aclanthology.org/2022.lrec-1.672/
  4. Bardhan J, Colas A, Roberts K, Wang DZ (2022). "DrugEHRQA: a question answering dataset on structured and unstructured electronic health records for medicine related queries". In: Calzolari N, Béchet F, Blache P, Choukri K, Cieri C, Declerck T, et al., editors. Proceedings of the thirteenth language resources and evaluation conference. Marseille, France: European Language Resources Association; 2022. p. 1083–97. Available from: https://aclanthology.org/2022.lrec-1.117/
  5. Dada A, Ufer TL, Kim M, Hasin M, Spieker N, Forsting M, et al. (2024). "Information extraction from weakly structured radiological reports with natural language queries". Eur Radiol. 2024;34(1):330–7
  6. Kweon S, Kim J, Kwak H, Cha D, Yoon H, Kim KH, et al. (2024). "EHRNoteQA: An LLM benchmark for real-world clinical practice using discharge summaries". In: The thirty-eight conference on neural information processing systems datasets and benchmarks track. 2024. Available from: https://openreview.net/forum?id=XrKhwfPmyI
  7. Aali A, Van Veen D, Arefeen YI, Hom J, Bluethgen C, Reis EP, et al. (2024). "A dataset and benchmark for hospital course summarization with adapted large language models". J Am Med Inform Assoc. 2024;ocae312
  8. Campillos-Llanos L, Reinares ART, Puig SZ, Valverde-Mateos A, Capllonch-Carrión A (2022). "Building a comparable corpus and a benchmark for Spanish medical text simplification". Proces Leng Nat. 2022;69:189–96
  9. Trienes J, Schlötterer J, Schildhaus HU, Seifert C (2022). "Patient-friendly clinical notes: Towards a new text simplification dataset". In: Štajner S, Saggion H, Ferrés D, Shardlow M, Sheang KC, North K, et al., editors. Proceedings of the workshop on text simplification, accessibility, and readability (TSAR-2022). Abu Dhabi, United Arab Emirates (Virtual): Association for Computational Linguistics; 2022. p. 19–27. Available from: https://aclanthology.org/2022.tsar-1.3/
  10. Ben Abacha A, Demner-Fushman D (2019). "On the summarization of consumer health questions". In: Korhonen A, Traum D, Màrquez L, editors. Proceedings of the 57th annual meeting of the association for computational linguistics. Florence, Italy: Association for Computational Linguistics; 2019. p. 2228–34. Available from: https://aclanthology.org/P19-1215/
  11. Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. (2023). "MIMIC-IV, a freely accessible electronic health record dataset". Sci Data. 2023;10(1):1
  12. Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, Letman A, et al. (2024). "The llama 3 herd of models". ArXiv Prepr ArXiv240721783. 2024
  13. Dada A, Bauer M, Contreras AB, Koraş OA, Seibold CM, Smith KE, et al. (2024). "Does biomedical training lead to better medical performance?". ArXiv Prepr ArXiv240404067. 2024
  14. Dorfner FJ, Dada A, Busch F, Makowski MR, Han T, Truhn D, et al. (2024). "Biomedical large languages models seem not to be superior to generalist models on unseen medical data". ArXiv Prepr ArXiv240813833. 2024
  15. Dada A, Koras OA, Bauer M, Butler A, Smith KE, Kleesiek J, et al. (2025). "MeDiSumQA: Patient-oriented question-answer generation from discharge letters". ArXiv Prepr ArXiv250203298. 2025

Parent Projects
MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Discovery

DOI (version 1.0.0):
https://doi.org/10.13026/f566-h049

DOI (latest version):
https://doi.org/10.13026/tvmt-8d97

Corresponding Author
You must be logged in to view the contact information.

Files