Database Open Access

ReXErr-v1: Clinically Meaningful Chest X-Ray Report Errors Derived from MIMIC-CXR

Vishwanatha Rao Serena Zhang Julian Acosta Subathra Adithan Pranav Rajpurkar

Published: March 19, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Rao, V., Zhang, S., Acosta, J., Adithan, S., & Rajpurkar, P. (2025). ReXErr-v1: Clinically Meaningful Chest X-Ray Report Errors Derived from MIMIC-CXR (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/9dns-vd94

Additionally, please cite the original publication:

Rao, V. M., Zhang, S., Acosta, J. N., Adithan, S., & Rajpurkar, P. (2024). ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports. In Biocomputing 2025: Proceedings of the Pacific Symposium (pp. 70-81).

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Interpreting medical images and writing radiology reports is a critical yet challenging task in healthcare. Despite their importance, both human-written and AI-generated reports are liable to errors, leaving a need for robust and representative datasets that capture the diversity of errors present across different mediums of report generation. Thus, we present Chest X-Ray Report Errors (ReXErr-v1), a new dataset based on MIMIC-CXR and constructed using large language models (LLMs) that contains synthetic error reports for the majority of MIMIC-CXR. Developed with input from board-certified radiologists, ReXErr-v1 contains plausible errors that closely mimic those found in real-world scenarios. Furthermore, ReXErr-v1 utilizes a novel sampling methodology that selects three errors to inject among a set of frequent errors made by both human and AI models. We include errors both at report and sentence level, improving the versatility of ReXErr-v1. Our dataset can enhance future AI reporting tools by aiding the development and evaluation of report-generation and error-screening algorithms.


Background

Radiology reports are essential tools for medical decision-making, but creating them is challenging even for trained specialists [1-3]. Human radiologists can make errors due to fatigue, high workloads, or the inherently subjective nature of image interpretation - from misreading scans to omitting critical details [4,5]. While recent advances in deep learning show promise in automating report generation, AI systems face their own challenges [6,7]. These automated systems can make mistakes ranging from minor reference errors to dangerous clinical oversights, often stemming from inherent biases within algorithms, model constraints, and limitations in the data used. Errors can range from references to non-existing priors, which are easier to detect, to false predictions or omissions, which are much more problematic clinically and often go unnoticed. The prevalence of errors, both in radiologist-written as well as AI-based reports, leaves a great need for more comprehensive tools that can screen for and correct them [8,9].

We release a dataset called Chest X-Ray Report Errors (ReXErr-v1), which leverages GPT-4o to systematically inject realistic errors into MIMIC-CXR radiology reports. Using GPT-4o and reports from MIMIC-CXR [10], we created two distinct components: report-level errors, where each report contains three carefully sampled errors from 12 possible categories, and sentence-level spliced errors, which pair original and error-containing sentences for detailed analysis. The error categories, developed in consultation with three board-certified radiologists, span three subcategories (content addition, context-dependent, and linguistic quality errors) and reflect both common human mistakes and AI-generated report errors. ReXErr-v1 aims to provide researchers with rich training data for developing robust error detection and correction algorithms, with each error carefully labeled to support both report-wide and sentence-level analysis tasks.


Methods

General Pipeline

To create ReXErr-v1, we worked with several clinicians and board-certified radiologists to brainstorm error categories that encompass common human and AI-model errors. Furthermore, we collaborated with them to develop an extensive prompting methodology that leverages GPT-4o to inject three sampled errors at a time per report. After generating error reports from the MIMIC-CXR dataset, reports were then spliced into individual sentences with post-hoc error labeling performed using Llama 3.1. See [11] for more details regarding the methodology used to develop ReXErr-v1, including the full-form prompts used for the LLMs as well as the sampling framework.

Error Categories and Prompting

The errors chosen fall under two broad types: AI generated report errors and human errors. Within each of these types, we define three categories, which include content addition, context-dependent, and linguistic quality errors. We made this distinction because the type of errors within each of these subcategories are dependent on the medium of report generation (AI vs human), and we wanted to maintain representation of errors across all possible categories. The error categories are as follows: "Add Medical Device", "False Prediction", "False Negation", Add Repetitions", "Add Contradictions", "Change Name of Device", "Change Position of Device", "Change Severity", "Change Location", "Change Measurement", "Change to Homophone", and "Add Typo". Notably, we used two types of linguistic quality errors to encompass both human and AI-model errors. Repetitions and contradictions were added to represent AI-model errors, while homophone and typo errors were inserted as clinician errors. See [11] for further information on how these categories were chosen.

To ensure we generated plausible errors, we constructed our prompts for GPT-4o with consistent feedback from clinicians in a systematic manner. We first iterated upon a base prompt for each error category and evaluated the quality of errors generated accordingly. In order to further improve the error quality, we incorporated examples of both model-generated and radiologist-provided errors within the prompts in an iterative manner, where we evaluated the error outputs following the addition of each group of examples to the prompt.

Sampling Scheme

We devised a novel sampling methodology in order to maintain the diversity and relevance of the errors injected. For each report, we sample three errors, one from each of the categories defined above: content addition, context-dependent, and linguistic quality errors. Each of these categories contain errors frequently made across both AI-models and humans, and choosing an error within a particular category is done at random. For the context-dependent category, we utilize a regex-based labeling method to ensure that the particular change made is consistent with the context provided in the report. Specifically, we tag each report with a device, measurement, location, or severity indicator, where a report can contain multiple tags. When sampling the context-dependent error, we only sample within the specific errors that correspond to the tags present for a report, and this is done randomly amongst the available subsection of errors.

MIMIC-CXR Preprocessing

We generated errors for the majority of MIMIC-CXR. Specifically, we filtered MIMIC for unique reports, listing multiple image ids alongside each report if a report is associated with multiple images within the dataset. Furthermore, we include error reports only for reports that contain an image within ['AP', 'PA', 'Lateral'], as other image views are less relevant towards most report generation and screening algorithms.

Synthesis and Sentence-Splicing

We used GPT-4o to generate the original errors at a report-level due to its performance and affordability. Each report was provided alongside a long-form prompt, containing sections corresponding to the particular sampled errors. Following the generation of errors at a report-level, we spliced each sentence and used Llama 3.1 to identify the type of error associated with each sentence and screen for priors at the same time. Each sentence was labeled with a 0 for a correct sentence, 1 if the sentence contains an error, and 2 if the sentence is neutral (references a prior). Error sentences are presented alongside the original sentence they were based upon, if relevant.

Validation and Quality Assurance

In addition to frequent clinical feedback during the development of the prompts, we conducted a clinician review after generating the dataset to validate the plausibility of the injected errors. Specifically, a clinician reviewed 100 randomly sampled paired original and error-injected reports to determine the fraction of error reports which are plausible AI-generated or human-written reports. This preliminary review found that of the 100 synthetic reports, only 17 contained errors that were implausible in the context of the original report. Outside of expert verification, we conducted a broad manual review of the sentence splicings to ensure that the majority of error sentences were correctly matched with their original sentence.


Data Description

ReXErr contains two folders and nine total files. We provide additional documentation and code for the dataset generation in a github repository [12]:

  1. ReXErr-report-level
    1. ReXErr-report-level_train.csv
    2. ReXErr-report-level_val.csv
    3. ReXErr-report-level_test.csv
  2. ReXErr-sentence-level
    1. ReXErr-sentence-level_train.csv
    2. ReXErr-sentence-level_val.csv
    3. ReXErr-sentence-level_test.csv
  3. README.md
  4. clinician-review.csv
  5. data-dictionary.txt

Here is an overview of the contents of each file provided:

ReXErr-report-level

  • ReXErr-report-level_{train/val/test}.csv contains the original and error reports from a filtered version of the MIMIC-CXR dataset corresponding to the train, val, or test set respectively. Each row contains a unique radiology report, which corresponds to multiple images present within MIMIC-CXR. Reports are listed in ascending subject ID. Each row of the CSV corresponds to the following:
    • dicom_id: Dicom ID(s) for the associated report
    • study_id: Study ID taken from MIMIC-CXR
    • subject_id: Subject ID taken from MIMIC-CXR
    • original_report: Original report taken from MIMIC-CXR
    • error_report: Report with errors injected using GPT-4o
    • errors_sampled: Errors that were sampled to create the error report. Note that the error report may not contain all of the errors sampled, and for more accurate labeling, see the sentence-level labeling.

ReXErr-sentence-level

  • ReXErr-sentence-level_{train/val/test}.csv contains the original and error sentences based on the ReXErr-report-level.csv file corresponding to the train, val, or test set respectively. Each row contains a sentence present within a radiology report, with spliced sentences presented in the same consecutive order that they appear within the original reports. Groups of sentences corresponding to a particular report are listed in ascending subject ID. Each row of the CSV corresponds to the following:
    • dicom_id: Dicom ID(s) for the associated report
    • study_id: Study ID taken from MIMIC-CXR
    • subject_id: Subject ID taken from MIMIC-CXR
    • original_sentence: Original sentence from the given MIMIC-CXR report
    • error_sentence: Sentence from the error-injected report. Note that the sentence itself may not necessarily contain an error, but it originates from the error-injected report.
    • error_present: Indicator for whether an error is present in the sentence, where 0 corresponds to unchanged sentence, 1 corresponds to error sentence, and 2 corresponds to neutral sentence (references a prior or does not contain any clinically relevant indications/findings)
    • error_type: If an error is present within the error_sentence, the specific type of error it is

README.md: readme file for the dataset.

clinician-review.csv: contains the results of the manual clinician review conducted on 100 randomly sampled original and error-injected reports from ReXErr. Each row of the csv corresponds to the following:

  • original_report: Original report taken from MIMIC-CXR
  • error_report: Report with errors injected using GPT-4o
  • errors_sampled: Errors that were sampled to create the error report
  • acceptable: Whether the synthetic error report was determined as plausible by the clinician, indicated by a Yes or No
  • comments: Relevant comments when the report is not plausible

data-dictionary.txt: contains a dictionary for each of the unique column variables present across the csv files.


Usage Notes

ReXErr can be used to study error patterns in radiology reports, train error detection systems, and benchmark automated correction algorithms. More discussion of possible use cases is provided in "ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports" (PSB 2025) [11], where we introduce the original methodology used to create ReXErr.

While ReXErr's error categories were developed with radiologist input and cover common mistake patterns, researchers should note that this is a synthetic dataset with artificially injected errors. The dataset can provide valuable insights into error detection and correction strategies, but should not be used as the sole benchmark for system performance. We encourage researchers to supplement ReXErr with real-world evaluations and to consider the specific limitations and biases of synthetically generated errors when developing clinical tools. Furthermore, it is important to note that there are sentence misalignments and incompletely generated error reports present in both the sentence and report levels of the dataset. While we attempted to automatically filter and minimize such mistakes, we were unable to manually review every sentence pairing and error report generated. Generally, the report-level dataset will be more accurate given the potential for mistakes introduced within the sentence-splicing and labeling pipeline.

Other factors may limit the ability of the errors captured within ReXErr to fully encompass all of the possible errors made by humans and AI models. For example, some of the errors are unrealistic, considering that post-hoc manual proofreading by clinicians found 17% of reports to contain implausible errors. The sampling approach may also contribute to these limitations as we do not account for nested compound errors, where errors belong to more than one category, or cases where sentences contain more than one error. Furthermore, ReXErr adds errors traditionally made by AI models to human text, making them not fully as representative of AI model errors when compared to adding such errors directly to AI-generated text. Together, these limitations are important to consider when using ReXErr.


Ethics

The MIMIC-III and MIMIC-CXR datasets have been de-identified, and their use for research has been approved by the institutional review boards of both the Massachusetts Institute of Technology (protocol No. 0403000206) and Beth Israel Deaconess Medical Center (protocol No. 2001-P-001699/14).

We verify that in the creation of ReXErr, LLM inference and data generation were conducted in a secure environment to ensure data safety and privacy. While ReXErr provides valuable training data for developing and analyzing automated medical report systems, researchers should use these synthetically generated error reports with caution. The error-containing reports should never be used for clinical decision-making or treated as ground truth data. Although the injected errors are based on real-world patterns identified by radiologists, they may not fully capture the complexity of mistakes in clinical practice. We encourage researchers to be transparent about the synthetic nature of these errors and to thoroughly validate any systems developed with this dataset using real clinical data before deployment.


Acknowledgements

We would like to thank Dr. John Farner and Dr. Rohit Reddy for their valuable clinical input into the error categories and prompts chosen. Vishwanatha M. Rao and Serena Zhang contributed equally to this work.


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Côté MJ, Smith MA. Forecasting the demand for radiology services. Health Systems. 2018 May 4;7(2):79-88.
  2. Reiner BI, Knight N, Siegel EL. Radiology reporting, past, present, and future: the radiologist’s perspective. Journal of the American College of Radiology. 2007 May 1;4(5):313-9.
  3. Al Yassin A, Sadaghiani MS, Mohan S, Bryan RN, Nasrallah I. It is About" Time": Academic Neuroradiologist Time Distribution for Interpreting Brain MRIs. Academic Radiology. 2018 Dec 1;25(12):1521-5.
  4. Bruno MA, Walker EA, Abujudeh HH. Understanding and confronting our mistakes: the epidemiology of error in radiology and strategies for error reduction. Radiographics. 2015 Oct;35(6):1668-76.
  5. Brady AP. Error and discrepancy in radiology: inevitable or avoidable?. Insights into imaging. 2017 Feb;8:171-82.
  6. Zhou HY, Adithan S, Acosta JN, Topol EJ, Rajpurkar P. A generalist learner for multifaceted medical image interpretation. arXiv preprint arXiv:2405.07988. 2024 May 13.
  7. Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang PC, Carroll A, Lau C, Tanno R, Ktena I, Palepu A. Towards generalist biomedical AI. Nejm Ai. 2024 Feb 22;1(3):AIoa2300138.
  8. Messina P, Pino P, Parra D, Soto A, Besa C, Uribe S, Andía M, Tejos C, Prieto C, Capurro D. A survey on deep learning and explainability for automatic report generation from medical images. ACM Computing Surveys (CSUR). 2022 Sep 14;54(10s):1-40.
  9. Sloan P, Clatworthy P, Simpson E, Mirmehdi M. Automated radiology report generation: A review of recent advances. IEEE Reviews in Biomedical Engineering. 2024 Jun 3.
  10. Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data. 2019 Dec 12;6(1):317.
  11. Rao VM, Zhang S, Acosta JN, Adithan S, Rajpurkar P. ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports. InBiocomputing 2025: Proceedings of the Pacific Symposium 2024 (pp. 70-81).
  12. GitHub Link: https://github.com/rajpurkarlab/ReXErr-V1 [Accessed on 3/6/2025]

Parent Projects
ReXErr-v1: Clinically Meaningful Chest X-Ray Report Errors Derived from MIMIC-CXR was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.

License (for files):
Open Data Commons Attribution License v1.0

Corresponding Author
You must be logged in to view the contact information.

Files

Total uncompressed size: 508.9 MB.

Access the files
Folder Navigation: <base>
Name Size Modified
ReXErr-report-level
ReXErr-sentence-level
LICENSE.txt (download) 19.9 KB 2025-03-13
README.md (download) 5.4 KB 2025-03-09
SHA256SUMS.txt (download) 1019 B 2025-03-19
clinician-review.csv (download) 93.6 KB 2025-03-09
data-dictionary.txt (download) 1.3 KB 2025-03-09