Name: MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark
Published: Nov. 3, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Asad Aali , Vasiliki Bikia , Maya Varma , Nicole Chiou , Sophie Ostmeier , Arnav Singhvi , Magdalini Paschali , Ashwin Kumar , Andrew Johnston , Karimar Amador Martinez , Eduardo Perez Guerrero , Paola Cruz Rivera , Sergios Gatidis , Christian Bluethgen , Eduardo Pontes Reis , Eddy Zandee van Rilland , Poonam Hosamani , Kevin Keet , Minjoung Go , Evelyn Ling , Curtis Langlotz , Roxana Daneshjou , Jason Hom , Sanmi Koyejo , Emily Alsentzer , Akshay Chaudhari

Published: Nov. 3, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Aali, A., Bikia, V., Varma, M., Chiou, N., Ostmeier, S., Singhvi, A., Paschali, M., Kumar, A., Johnston, A., Amador Martinez, K., Perez Guerrero, E., Cruz Rivera, P., Gatidis, S., Bluethgen, C., Reis, E. P., Zandee van Rilland, E., Hosamani, P., Keet, K., Go, M., ... Chaudhari, A. (2025). MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/8ga5-6661

MLA	Aali, Asad, et al. "MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/8ga5-6661
APA	Aali, A., Bikia, V., Varma, M., Chiou, N., Ostmeier, S., Singhvi, A., Paschali, M., Kumar, A., Johnston, A., Amador Martinez, K., Perez Guerrero, E., Cruz Rivera, P., Gatidis, S., Bluethgen, C., Reis, E. P., Zandee van Rilland, E., Hosamani, P., Keet, K., Go, M., ... Chaudhari, A. (2025). MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/8ga5-6661
Chicago	Aali, Asad, Bikia, Vasiliki, Varma, Maya, Chiou, Nicole, Ostmeier, Sophie, Singhvi, Arnav, Paschali, Magdalini, Kumar, Ashwin, Johnston, Andrew, Amador Martinez, Karimar, Perez Guerrero, Eduardo, Cruz Rivera, Paola, Gatidis, Sergios, Bluethgen, Christian, Reis, Eduardo Pontes, Zandee van Rilland, Eddy, Hosamani, Poonam, Keet, Kevin, Go, Minjoung, Ling, Evelyn, Langlotz, Curtis, Daneshjou, Roxana, Hom, Jason, Koyejo, Sanmi, Alsentzer, Emily, and Akshay Chaudhari. "MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/8ga5-6661
Harvard	Aali, A., Bikia, V., Varma, M., Chiou, N., Ostmeier, S., Singhvi, A., Paschali, M., Kumar, A., Johnston, A., Amador Martinez, K., Perez Guerrero, E., Cruz Rivera, P., Gatidis, S., Bluethgen, C., Reis, E. P., Zandee van Rilland, E., Hosamani, P., Keet, K., Go, M., Ling, E., Langlotz, C., Daneshjou, R., Hom, J., Koyejo, S., Alsentzer, E., and Chaudhari, A. (2025) 'MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark' (version 1.0.0), PhysioNet. RRID:SCR_007345. Available at: https://doi.org/10.13026/8ga5-6661
Vancouver	Aali A, Bikia V, Varma M, Chiou N, Ostmeier S, Singhvi A, Paschali M, Kumar A, Johnston A, Amador Martinez K, Perez Guerrero E, Cruz Rivera P, Gatidis S, Bluethgen C, Reis E P, Zandee van Rilland E, Hosamani P, Keet K, Go M, Ling E, Langlotz C, Daneshjou R, Hom J, Koyejo S, Alsentzer E, Chaudhari A. MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark (version 1.0.0). PhysioNet. 2025. RRID:SCR_007345. Available from: https://doi.org/10.13026/8ga5-6661

Additionally, please cite the original publication:

Aali A, Bikia V, Varma M, Chiou N, Ostmeier S, Singhvi A, et al. MedVAL: Toward expert-level medical text validation with language models. arXiv preprint. 2025. arXiv:2507.03152.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

MedVAL-Bench is a dataset containing physician evaluations of errors in language model (LM)-generated medical text. The dataset spans 6 diverse medical text generation tasks and includes annotations from 12 physicians on clinically significant errors for 840 LM-generated outputs. These text-to-text generation tasks involve transforming an input medical text into an output relevant to a specific use case. Each task includes inputs and corresponding LM-generated outputs, which are evaluated for factual consistency by physicians. Importantly, the MedVAL framework and dataset are designed to rely only on inputs for the evaluation process to allow working with datasets that may not have reference outputs, ensuring broad applicability. The evaluation process aims to determine whether the output is factually consistent with the input and is safe for use. MedVAL-Bench constitutes the first large-scale physician-validated benchmark with triage-style risk grading aligned to real-world clinical decision-making, supporting the development of automated, expert-aligned evaluation methods and facilitating research toward trustworthy medical text generation.

Background

Language models (LMs) are increasingly applied to generate medical text, supporting tasks such as summarizing patient records, drafting radiology reports, and answering medical questions [1–3]. While these systems offer potential for reducing documentation burden, their adoption requires scalable and reliable risk assessment, which can help determine whether LM-generated outputs are safe for use. A key component of this risk assessment is medical text validation, which we define as the process to determine whether the LM-generated output is factually consistent with the input.

Despite this need, automatically evaluating the factual consistency of LM-generated outputs remains challenging. Traditional NLP metrics such as BLEU and ROUGE rely on reference outputs and only assess surface-level n-gram overlap to estimate quality [4], failing to detect subtle errors. In practice, the gold standard for validating LM-generated medical text remains manual physician review, which is costly, time-consuming, and difficult to scale. Consequently, there is a pressing need for scalable validation strategies that can accurately assess the factual consistency and safety of LM-generated outputs without relying on reference outputs or physicians.

To address this need, we introduce MedVAL-Bench, a dataset designed to facilitate research into the development and evaluation of automated tools that can detect factual inconsistencies and assess the safety of LM-generated medical text. MedVAL-Bench includes error assessments and risk grades for 840 LM-generated outputs, annotated by 12 physicians. These evaluations span 6 diverse tasks that directly support risk-based triage, providing coverage across question answering, summarization, simplification, and translation. These inputs were chosen to cover a broad set of scenarios often encountered in clinical documentation and patient interaction. While prior benchmarks are task-specific and reference-dependent, MedVAL-Bench constitutes the first large-scale physician-validated benchmark with triage-style risk grading aligned to real-world clinical decision-making, supporting the development of automated, expert-aligned evaluation methods and facilitating research toward trustworthy medical text generation.

Methods

Data

The medical text inputs for MedVAL-Bench were compiled from 6 publicly available datasets, providing a broad set of scenarios often encountered in clinical documentation and patient interaction. While these inputs were extracted from the datasets, the outputs were generated by LMs, and the subsequent error assessments and risk gradings were meticulously performed by our team of physicians. Furthermore, the output generation prompts included instructions to perturb the outputs based on randomly chosen levels, following a well-defined risk grading schema.

The following tasks each categorize a distinct type of input and LM-generated output:

Task Name	Data Source	Task Description	Physicians	Count
medication2answer	MedicationQA [5]	medication question → answer	2	135
query2question	MeQSum [6]	patient query → health question	3	120
report2impression	Open-i [7]	findings → impression	5	190
impression2simplified	MIMIC-IV [8]	impression → patient-friendly	5	190
bhc2spanish	MIMIC-IV-BHC [9]	hospital course → spanish	3	120
dialogue2note	ACI-Bench [10]	doctor-patient dialogue → note	2	85

Participants

12 physicians with diverse specialties and extensive clinical experience participated in the annotation process. These physicians were carefully selected to ensure they could effectively assess the clinical relevance and factual consistency of the LM-generated texts. Their specialties include:

Internal Medicine: For general medical tasks like medication2answer, query2question, and dialogue2note, the annotations were performed by 4 board-certified internal medicine physicians. For the bhc2spanish task, the annotations were performed by 3 bilingual internal medicine residents.
Radiology: For report2impression and impression2simplified tasks, the annotations were performed by 1 radiology resident and 4 board-certified radiologists.

Task

For each study, the physicians were presented with the input followed by the LM-generated output. The physicians were then requested to perform two tasks:

1. Error Assessment

Identify clinically significant factual consistency errors in the candidate under predefined categories:

Fabricated claim: Introduction of a claim not present in the reference.
Misleading justification: Incorrect reasoning, leading to misleading conclusions.
Detail misidentification: Incorrect reference to a detail in the input.
False comparison: Mentioning a comparison not supported by the input.
Incorrect recommendation: Suggesting a diagnosis/follow-up outside the input.
Missing claim: Failure to mention a claim present in the input.
Missing comparison: Omitting a comparison that details change over time.
Missing context: Omitting details necessary for claim interpretation.
Overstating intensity: Exaggerating urgency, severity, or confidence.
Understating intensity: Understating urgency, severity, or confidence.
Other: Additional errors not covered.

2. Risk Grading

Assign a risk level to the candidate based on its factual consistency with the reference:

Level 1 (No Risk): Safe; expert review not required.
Level 2 (Low Risk): Acceptable; expert review optional.
Level 3 (Moderate Risk): Potentially unsafe; expert review required.
Level 4 (High Risk): Unsafe; expert rewrite required.

Data Description

All tasks are combined into a single CSV file (medval_bench.csv), with a task column indicating the source task (e.g., report2impression, dialogue2note). This unified file structure ensures consistent formatting across tasks, with each row corresponding to one physician-annotated LM-generated output; containing the following columns:

#: A unique identifier for each record in the dataset.
id: A unique identifier for each record under a task.
task: The medical text generation task.
input: The expert-composed input that is used to generate the output.
Example: FINDINGS: No pleural effusion or pneumothorax. Heart size normal.
reference_output: The expert-composed output (only available for medication2answer, query2question, report2impression, and dialogue2note).
Example: IMPRESSION: No acute cardiopulmonary findings.
output: The AI-generated output (randomly perturbed using one of four risk levels), which is being evaluated against the input.
Example: IMPRESSION: Small pleural effusion.
physician_error_assessment: Physician assessment of the AI-generated output, following an error category taxonomy (hallucinations, omissions, or certainty misalignments).
Example: Error 1: Hallucination - 'Small pleural effusion' is a fabricated claim.
physician_risk_grade: Physician-assigned risk level of the AI-generated output, following a risk level taxonomy (between 1 and 4).
Example: Level 4 (High Risk)

Usage Notes

The MedVAL-Bench dataset is designed to facilitate research into assessing the factual consistency of LM-generated medical text. While this dataset is crucial in developing and evaluating the MedVAL framework, it also holds potential for further research. Below are some key usage points:

Validation of Automated Metrics: MedVAL-Bench can be used to compare the performance of various automated metrics against expert annotations for detecting factual inconsistencies.
Training Validator LMs: Researchers can leverage the dataset to train language models to reproduce physician-level error assessment and risk analysis of LM-generated text.
Task-Specific Insights: The dataset, compiled across diverse medical tasks, can provide valuable insights into task-specific challenges and solutions in medical text generation and evaluation.

Users should note that MedVAL-Bench has certain limitations:

Outputs include prompt-induced (partly simulated) perturbations, meaning not all errors arise naturally from language model generation.
Reference-free evaluation can be a bottleneck for certain tasks (QA), where the input alone may not contain sufficient information to enable a comprehensive assessment of the output.
The six medical tasks may not represent the full spectrum of medical documentation.

Ethics

The MIMIC-IV dataset has been de-identified, and its use for research has been approved by the institutional review board (IRB) of Beth Israel Deaconess Medical Center. All other datasets utilized in this study are open-source and publicly available, thereby exempting this research from additional IRB approvals. We use preprocessed versions available publicly on the GitHub repository [11] for Open-i (radiology reports), MeQSum (patient questions), and ACI-Bench (dialogue). For MedicationQA (Hugging Face), the dataset card does not declare a license. The redistribution of text excerpts follows the terms of the original datasets. For each dataset, we cite and link to the original sources.

Acknowledgements

We acknowledge the support of the Advanced Research Projects Agency for Health (ARPA-H): Chatbot Accuracy and Reliability Evaluation (CARE) project.

Conflicts of Interest

None to declare.

References

Moradi M, Samwald M. Deep learning, natural language processing, and explainable artificial intelligence in the biomedical domain. arXiv [Preprint]. 2022 Feb 25. arXiv:2202.12678.
Van Veen D, Hedayatnia B, Garcia A, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. 2024.
Aali A, Bikia V, Varma M, et al. A dataset and benchmark for hospital course summarization with adapted large language models. J Am Med Inform Assoc. 2024.
Xie Y, Xu J, Ma Y, et al. DocLens: Multi-aspect fine-grained medical text evaluation. In: Proceedings of the Association for Computational Linguistics (ACL). 2024.
TrueHealth. MedicationQA dataset [Internet]. Hugging Face; 2024. Available from: https://huggingface.co/datasets/TrueHealth/MedicationQA [Accessed 2025 Apr 23].
Abacha AB, Demner-Fushman D. On the summarization of consumer health questions. In: Proc Assoc Comput Linguist (ACL). 2019.
Demner-Fushman D, Antani S, Simpson M, et al. Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc. 2016;23(2):304–310.
Johnson AE, Bulgarelli L, Pollard TJ, et al. MIMIC-IV-Note: Deidentified free-text clinical notes. PhysioNet; 2023. Available from: https://doi.org/10.13026/7qgp-kc16
Aali A, Varma M, Singhvi A, et al. MIMIC-IV-Ext-BHC: Labeled clinical notes dataset for hospital course summarization. PhysioNet; 2024. Available from: https://doi.org/10.13026/fh2q-4148
Yim WW, Zhang H, Ghassemi M, et al. ACI-Bench: A novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Sci Data. 2023;10:191.
Clinical Text Summarization by Adapting LLMs [Internet]. Available from: https://github.com/StanfordMIMI/clin-summ [cited 2025 Sep 23]