Database Credentialed Access

ReFiSco: Report Fix and Score Dataset for Radiology Report Generation

Katherine Tian Sina J Hartung Andrew A Li Jaehwan Jeong Fardad Behzadi Juan Calle-Toro Subathra Adithan Michael Pohlen David Osayande Pranav Rajpurkar

Published: Aug. 23, 2023. Version: 0.0

When using this resource, please cite: (show more options)
Tian, K., Hartung, S. J., Li, A. A., Jeong, J., Behzadi, F., Calle-Toro, J., Adithan, S., Pohlen, M., Osayande, D., & Rajpurkar, P. (2023). ReFiSco: Report Fix and Score Dataset for Radiology Report Generation (version 0.0). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Automated generation of clinically accurate radiology reports can improve patient care. In order to improve automatic report generation, it is helpful to understand what types of errors are common in generated reports. Thus, we introduce the Report Fix and Score Dataset for Radiology Reports (ReFiSco-v0), which was collected through an institutional review board-approved study. In our study, we recruit radiologists to provide expert evaluations on a subset of 60 studies from MIMIC-CXR. For each radiology image, we compile three reports: one generated from the model X-REM, one from the model CXR-RePaiR trained on the same MIMIC-CXR training set, and one taken from a human benchmark (MIMIC-CXR). To each radiologist, we present one image and one report for each of the 60 studies. Each report is randomly and independently chosen from one of the three sources. The radiologist is blinded to the source. We ask each radiologist to assess the error severity of their assigned reports.


Automated generation of clinically accurate chest radiology reports can improve patient care. Current state-of-the-art approaches to automated report generation leverage deep learning methods trained on paired radiology image and report data to generate text reports safely [1-6]. Report generation models are often evaluated on automated quantitative metrics, such as BLEU score (a natural language metric) or F1 score (a clinical accuracy metric). However, how aligned these automated metrics are with human radiologists' evaluation of reports is still an open research question [7]. Alignment of generated radiology reports with human expert evaluation is critical for the reliability and safe deployment of automated report generation systems. 

In this study, we release a dataset called Report Fix and Score Dataset for Radiology Reports (ReFiSco-v0) that aims to help with aligning machine-generated radiology reports with human expert evaluations. Our dataset is a collection of generated and human-written reports for a subset of chest X-ray images from MIMIC-CXR [8] radiology studies. To obtain generated reports, we consider a recent state-of-the-art report generation model called X-REM [1] and a baseline model called CXR-RePaiR [4]. Both report generation models are retrieval-based models based on contrastive image-text pre-trained representations learned from radiology image and report impression pairs. The impression is a designated section of the text report which contains a summary of the most important clinical findings from the radiology study. For the human benchmark, we take reports from MIMIC-CXR [8]. While X-REM achieves state-of-the-art quantitative evaluation metrics, we still wish to understand how well those metrics align with human evaluation. The goal of this dataset is to aid other researchers in (1) studying how human radiologist evaluation of these models compares to automated metrics, (2) understanding the gap between current state-of-the-art generation models and other human-written reports, and (3) developing similar studies or datasets on a larger scale. The availability of this dataset can foster research efforts in the field of machine learning in radiology, while researchers must also consider and address potential biases when developing machine learning models [9].


In this section, we describe how the ReFiSco-v0 dataset was created.

Raw Data Description:

For this study, we randomly selected a subset of 60 studies from the MIMIC-CXR dataset with (i) a single frontal view and (ii) no references to prior examinations. For each study, we compile 3 reports: one generated from the model X-REM, one generated from a baseline model CXR-RePaiR, and one written by a human radiologist taken from MIMIC-CXR.

Study Design / Methodology:

We recruited 4 practicing radiologists to annotate these reports. After informed consent, these radiologists were given a 30-minute orientation and instructions on how to complete their assigned annotation task. Their instructions are at this link [10].

To each radiologist, we presented one frontal chest x-ray and one report for each of the 60 studies. Each report was randomly and independently chosen from the X-REM, CXR-RePaiR, or MIMIC-CXR source. The radiologist was blind to the source of the report.

Then, the radiologist was asked to annotate the report for each image. They are asked to mark each line of the report with one of 5 error categories. In increasing order of severity, the categories are 'No error', 'Not actionable', 'Actionable nonurgent error', 'Urgent error', or 'Emergent error'. In addition, the radiologist is asked to correct each error using either deletion, substitution, or insertion of a line. Note: a generated comparison to any prior study was considered a non-actionable error.

Data Analysis:

We also provide a data analysis Jupyter notebook that computes some properties and insights from this dataset, such as average and maximum report error severity, inter-annotator agreement, and statistical tests between categories of reports. The notebook requires few packages: just Pandas, Numpy, Scipy, and Matplotlib. 

Data Description

ReFiSco-v0 contains two files, which are also provided in this GitHub repository [11] for version control:

  1. refisco-v0.csv
  2. Data_Analysis.ipynb

1. refisco-v0.csv contains the radiologists’ error annotations. Each radiology report is broken down into lines, and consecutive rows of the CSV with the same study id describe lines that constitute one report annotation. Each row of the CSV contains the following columns:

  • id: the study id from MIMIC-CXR
  • indication: the indication for the study corresponding to the row's id if available
  • impression_original: one line of the impression from the report from this study, unedited 
  • impression_edited: a radiologist's revision of the original line
  • score: a radiologist's severity score of the revision
  • annotator: the annotator id
  • source: the report source (either the name of the report-generation model or 'expert' which represents a human-written report)

2. Data_Analysis.ipynb is an example iPython notebook that loads refisco-v0.csv and performs data analysis on the annotated errors. Future users of this dataset can take inspiration from code snippets in Data_Analysis.ipynb to study report error severity, inter-annotator agreement, statistical tests between categories of reports, or other quantities of interest. 

Usage Notes

This dataset was originally introduced in "Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation" (MIDL 2023) [1], where it was used to provide additional human evaluation for the X-REM model to supplement automatic metrics. As an example, we provide `Data_Analysis.ipynb` which contains the code for the data analysis that was done in the paper.

Others interested in studying types of errors and error severity in generated reports can further analyze this dataset. However, potential users should note that the dataset is small and can provide initial but by no means comprehensive insight into the types of mistakes that human experts observe in radiology reports.


Our research was conducted with IRB approval (IRB22-0364, “A clinically based evaluation system for chest radiograph AI-generated reports”). The collection of this dataset provides an evaluation framework for in-depth human evaluation of generated radiology reports, as opposed to using computational metrics such as BLEU score or F1 score. We hope to provide a step towards safely using generated reports in clinical use cases. However, v0 is a preliminary study, with a small subset of MIMIC-CXR reports and a limited number of radiologists participating. Progress could be made in standardizing and aligning different radiologists' opinions. 

Conflicts of Interest

We have no conflicts of interest


  1. Jeong, J., Tian, K., Li, A., Hartung, S., Adithan, S., Behzadi, F., Calle, J., Osayande, D., Pohlen, M., & Rajpurkar, P. (2023). Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation. In Medical Imaging with Deep Learning. Retrieved from
  2. Chen, Z., Song, Y., Chang, T.-H., & Wan, X. (2020). Generating radiology reports via memory-driven transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1439-1449). Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.112. Retrieved from
  3. Miura, Y., Zhang, Y., Tsai, E., Langlotz, C., & Jurafsky, D. (2021). Improving factual completeness and consistency of image-to-text radiology report generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 5288-5304). Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.416. Retrieved from
  4. Endo, M., Krishnan, R., Krishna, V., Ng, A. Y., & Rajpurkar, P. (2021). Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In S. Roy, S. Pfohl, E. Rocheteau, G. A. Tadesse, L. Oala, F. Falck, Y. Zhou, L. Shen, G. Zamzmi, P. Mugambi, A. Zirikly, M. B. A. McDermott, & E. Alsentzer (Eds.), Proceedings of Machine Learning for Health (Vol. 158, pp. 209-219). PMLR. Retrieved from
  5. Babar Z, van Laarhoven T, Marchiori E. Encoder-decoder models for chest X-ray report generation perform no better than unconditioned baselines. Plos one. 2021 Nov 29;16(11):e0259639.
  6. Seastedt KP, Moukheiber D, Mahindre SA, Thammineni C, Rosen DT, Watkins AA, Hashimoto DA, Hoang CD, Kpodonu J, Celi LA. A scoping review of artificial intelligence applications in thoracic surgery. European Journal of Cardio-Thoracic Surgery. 2022 Feb 1;61(2):239-48.
  7. Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E. P., Fonseca, E. K. U. N., Lee, H. M. H., Abad, Z. S. H., Ng, A. Y., Langlotz, C. P., Venugopal, V. K., & Rajpurkar, P. (2022). Evaluating progress in automatic chest x-ray radiology report generation. medRxiv. doi: 10.1101/2022.08.30.22279318. Retrieved from studies have explored the potential of multimodal approaches, integrating both visual and textual information[here]. Current state-of-the-art approaches to automated report generation leverage deep learning methods trained on paired radiology image and report data to generate text reports based on images [1, 2, 3, 4].
  8. Johnson, A. E. W., Pollard, T. J., Greenbaum, N. R., Lungren, M. P., Deng, C.-Y., Peng, Y., Lu, Z., Mark, R. G., Berkowitz, S. J., & Horng, S. (2019). Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. Retrieved from
  9. Nazer LH, Zatarah R, Waldrip S, Ke JX, Moukheiber M, Khanna AK, Hicklen RS, Moukheiber L, Moukheiber D, Ma H, Mathur P. Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digital Health. 2023 Jun 22;2(6):e0000278.
  10. Radiologist Instructions: [Accessed on 7/11/2023]
  11. GitHub Link: [Accessed on 7/11/2023]

Parent Projects
ReFiSco: Report Fix and Score Dataset for Radiology Report Generation was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.