Database Credentialed Access

CXR-PRO: MIMIC-CXR with Prior References Omitted

Vignav Ramesh Nathan Chi Pranav Rajpurkar

Published: Nov. 23, 2022. Version: 1.0.0

When using this resource, please cite: (show more options)
Ramesh, V., Chi, N., & Rajpurkar, P. (2022). CXR-PRO: MIMIC-CXR with Prior References Omitted (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Ramesh, V., Chi, N. A., & Rajpurkar, P. (2022). Improving radiology report generation systems by removing hallucinated references to non-existent priors. arXiv.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


CXR-PRO is an adaptation of the MIMIC-CXR dataset that omits references to prior radiology reports. Consisting of 374,139 free-text radiology reports and associated chest radiographs, CXR-PRO addresses the issue of hallucinated references to priors produced by radiology report generation models. By removing nearly all prior references in MIMIC-CXR, CXR-PRO, when used as training data for report generation models, is capable of broadly improving the factual consistency and accuracy of generated reports. More generally, this dataset aims to support a wide body of research in medical image analysis and natural language processing. MIMIC-CXR is a de-identified dataset, so no protected health information (PHI) is included.


Writing radiology reports is a tedious and labor-intensive process, requiring trained specialists to conduct in-depth analyses of chest radiographs and create detailed reports of their findings. This process is also inherently restricted by a variety of human limitations, including the experience of the radiologist and availability of medical support staff. Therefore, automatically generating free-text radiology reports from chest radiographs has immense clinical value.

Current deep learning models trained to generate radiology reports from chest X-rays (CXR-RePaiR [1], R2Gen [2], M2 Trans [3], etc.) have achieved relative success in producing complete, consistent, and clinically accurate reports. Nevertheless, these models each have a key limitation: since they are trained on datasets of real-world reports (e.g., MIMIC-CXR [4]) which refer to prior reports, their outputs often contain hallucinated references to non-existent priors [5].

To this end, we propose CXR-PRO, an adapted version of the MIMIC-CXR dataset with prior references omitted. As a proof of concept, we find that the CXR-RePaiR radiology report generation model, when retrained on CXR-PRO, outperforms SOTA baselines on clinical metrics across the board; as such, we expect CXR-PRO to be broadly valuable in enabling current radiology report generation systems to be more directly integrated into clinical pipelines. We introduce CXR-PRO and elaborate on this proof of concept in our paper available at [6].


The creation of MIMIC-CXR required handling three distinct data modalities: electronic health record data, images (chest radiographs), and natural language (free-text reports). Chest radiographs were sourced from the hospital picture archiving and communication system (PACS) in Digital Imaging and Communications in Medicine (DICOM) format, a common format for medical images which facilitates interoperability of many distinct medical devices. Radiology reports for the images were identified and extracted from the hospital EHR system. The reports were de-identified using a rule-based approach based upon prior work combined with a newly developed neural network approach. PHI has been replaced with three consecutive underscores ("___"). Study reports are stored in individual text files named using the anonymous study identifier. Note that one or more image files may be associated with a study, but only one radiology report is written. See the MIMIC-CXR PhysioNet page for additional details.

Removing References to Priors

Given that CXR-PRO is an adaptation of MIMIC-CXR, we did not collect any clinical data, but rather developed methods to automatically remove references to priors in MIMIC-CXR’s radiology reports.

Our primary model, GILBERT, is a fine-tuned BioBERT model that removes prior references at the token level. Specifically, GILBERT casts the prior reference removal process as a named entity recognition (NER) task; the model classifies each token in an inputted report as either REMOVE (denoting that the token constitutes a reference to a prior and should be removed from the outputted report) or KEEP (indicating that the token does not constitute a reference to a prior and should be included in the final report). For example, the sentence “hilar prominence suggestive of pulmonary hypertension, unchanged” results in the output “KEEP KEEP KEEP KEEP KEEP KEEP REMOVE”). 

To generate the CXR-PRO training set, we pass the entirety of the MIMIC-CXR radiology report corpus through GILBERT. In particular, we (1) split the MIMIC-CXR dataset in accordance with the train-test split provided in the MIMIC-CXR-JPG documentation; (2) extract the impressions section from each radiology report; and (3) run GILBERT on the impressions sections in the training set to remove all references to priors. 

To provide an expert-created evaluation set that does not rely on machine learning models, we recruited a team of medical professionals (one board-certified radiologist and two fourth year medical students) to create a ground truth compilation of reports without references to priors. In particular, we make use of a randomly selected subset of MIMIC-CXR containing 2,188 images and associated reports. We provide the medical annotators with the directive to either remove or rewrite references to priors in the test set reports. For instance, “no interval change from prior CT” is a phrase that can be removed completely, while “heart size is stable” must be changed to a description of the heart’s current state (e.g., “heart size is abnormal”) rather than simply removed.

Data Description

CXR-PRO contains the following files:

├── cxr.h5 
├── mimic_train_impressions.csv 
└── mimic_test_impressions.csv 

The contents of each file are outlined below:

cxr.h5: The MIMIC-CXR chest radiographs used for CXR-PRO, saved in Hierarchical Data Format (HDF).

mimic_train_impressions.csv (contains 371,951 reports): A compilation of the impressions section of each radiology report in the CXR-PRO dataset, with references to priors removed. Additional fields include dicom_id, study_id, and subject_id (which refer users to the chest radiograph associated with a given impressions section).

mimic_test_impressions.csv (contains 2,188 reports): The expert-edited test set, as described in the Methods section.

Usage Notes

The CXR-PRO database was introduced in our paper “Improving Radiology Report Generation Systems By Removing Hallucinated References to Non-existent Priors” [6], where it was used as training data for the report generation model CXR-ReDonE. The dataset has a high degree of potential for reuse: that is, nearly any existing report generation model could be re-trained on CXR-PRO and yield more factually consistent and complete reports. Users should be aware, however, that CXR-PRO may contain a limited number of ungrammatical sentences generated by GILBERT—most notably, verbless phrases such as “The cardiomediastinal and hilar contours.” As for complementary resources, the code used to create CXR-PRO by removing references to priors in radiology reports, as well as to retrain CXR-RePaiR on the CXR-PRO dataset, is publicly available on GitHub at rajpurkarlab/CXR-ReDonE.


The benefits of our work include providing model-agnostic methods and data to broadly improve the accuracy of radiology report generation models. As far as we can tell, our project has no significant risks.

Our research was conducted with IRB approval (IRB22-0364, “A clinically based evaluation system for chest radiograph AI-generated reports”).


We would like to thank Dr. Kibo Yoon, Patricia S. Pile, and Pia G. Alfonso for their central role in developing the CXR-PRO test set with prior references manually removed or replaced with clinically accurate statements.

Conflicts of Interest

We have no conflicts of interest.


  1. Endo, M., Krishnan, R., Krishna, V., Ng, A. Y., & Rajpurkar, P. (2021). Retrieval-Based Chest X-Ray Report Generation Using a Pre-trained Contrastive Language-Image Model. In S. Roy, S. Pfohl, E. Rocheteau, G. A. Tadesse, L. Oala, F. Falck, Y. Zhou, L. Shen, G. Zamzmi, P. Mugambi, A. Zirikly, M. B. A. McDermott, & E. Alsentzer (Eds.), Proceedings of Machine Learning for Health (Vol. 158, pp. 209–219). PMLR.
  2. Miura, Y., Zhang, Y., Tsai, E., Langlotz, C., & Jurafsky, D. (2021). Improving factual completeness and consistency of image-to-text radiology report generation. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5288–5304.
  3. Chen, Z., Song, Y., Chang, T. H., & Wan, X. (2020). Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056.
  4. Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR Database (version 2.0.0). PhysioNet.
  5. Johnson, A. E., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., ... & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1), 1-8.
  6. Ramesh, V., Chi, N. A., & Rajpurkar, P. (2022). Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors. arXiv preprint arXiv:2210.06340.

Parent Projects
CXR-PRO: MIMIC-CXR with Prior References Omitted was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.