Database Open Access

Radiology Report Generation Models Evaluation Dataset For Chest X-rays (RadEvalX)

Amos Rubin Calamida Farhad Nooralahzadeh Morteza Rohanian Mizuho Nishio Koji Fujimoto Michael Krauthammer

Published: June 18, 2024. Version: 1.0.0


When using this resource, please cite: (show more options)
Calamida, A. R., Nooralahzadeh, F., Rohanian, M., Nishio, M., Fujimoto, K., & Krauthammer, M. (2024). Radiology Report Generation Models Evaluation Dataset For Chest X-rays (RadEvalX) (version 1.0.0). PhysioNet. https://doi.org/10.13026/tp88-q278.

Additionally, please cite the original publication:

Amos Calamida and Farhad Nooralahzadeh and Morteza Rohanian and Koji Fujimoto and Mizuho Nishio and Michael Krauthammer (2023). “Radiology-Aware Model-Based Evaluation Metric for Report Generation”.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

The Radiology Report Generation Models Evaluation Dataset For Chest X-rays (RadEvalX) is publicly available and developed similarly to the ReXVal dataset. Just like ReXVal, RadEvalX focuses on radiologist evaluations of errors found in automatically generated radiology reports. The dataset includes annotations from two board-certified radiologists, who identified clinically significant and clinically insignificant errors across eight different categories of errors. Compared to the ground-truth reports from the IU-Xray dataset, the evaluations were done on candidate radiology reports. For every 100 studies and corresponding ground-truth reports, the dataset contains one report generated using the M2Tr model from the corresponding X-ray image. The radiologists then annotated these reports. The primary purpose of this dataset is to assess the correlation between automated metrics and human radiologists' scoring, explore the limitations of automated metrics, and develop a model-based automated metric. This dataset has been created to support further research in medical artificial intelligence (AI), particularly in the field of radiology.


Background

Evaluation metrics are essential to assess the performance of Natural Language Generation (NLG) systems. Although traditional metrics are widely used due to their simplicity, they have limitations in their correlation with human judgments, leading to the need for newer evaluation metrics [1, 2, 3]. However, the literature has not widely adopted newer metrics due to poor explainability and lack of benchmarking [4]. In the medical image report generation domain, several new metrics have been developed, including medical abnormality terminology detection [5], MeSH accuracy [6], medical image report quality index [7], and anatomical relevance score [8]. These metrics aim to establish more relevant evaluation measures than traditional metrics such as BLEU. However, newer publications still rely on traditional metrics despite their existence, leading to less meaningful evaluations of specialized tasks [9].
Radiology reports are narratives that should accurately reflect essential properties of the entities depicted in the scan. These reports consist of multiple sentences, including the position and severity of abnormalities and concluding remarks summarizing the most prominent observations. The task of radiology report generation is challenging due to its unique characteristics and the need for accurate clinical descriptions [10]. However, current metrics like BLEU do not capture these specific properties, highlighting the need for domain-specific metrics that consider the unique requirements of radiology reports.

At a high level of abstraction, we attempt to answer the following main research questions in our paper [11]:

  • Can an existing successful metric model architecture be adapted and optimized to develop a novel radiology-specific metric for evaluating the quality and accuracy of automatically generated radiology reports?
  •  To what extent does the integration of radiology-aware knowledge impact the precision and dependability of the assessment metric in evaluating the efficacy and accuracy of automatically generated radiology reports?

To this end, we suggest an automated measurement for assessing radiology report generation models. It aims to enhance existing metrics designed for different domains, including both automated metrics like COMET (Crosslingual Optimized Metric for Evaluation of Translation) [12] and traditional metrics like SPIDEr (Semantic Propositional Image Description Evaluation) [13] or BLEU [14]. This improvement involves incorporating a radiology-specific knowledge graph called RadGraph [15].

The ways in which we have contributed in our work [11] are outlined below:

  • We design an evaluation model tailored explicitly for assessing radiology reports generated by generative models. By incorporating domain-specific knowledge from RadGraph, a radiology-aware knowledge graph, we aim to enhance the accuracy and relevance of the assessment.
  •  We evaluate the proposed strategy by applying it to a set of radiology reports generated by two models. We use the IU X-ray dataset of ground truth radiology reports and compare the automated scores obtained using our framework with the scores of other established metrics.
  •  We perform an error analysis study with radiology experts that examine the discrepancies between the generated and the ground truth reports using the RadEvalX and ReXVal [16]. This analysis allows us to further identify the quality of our metric compared with human judgment.

Our work focuses on developing a novel evaluation metric to evaluate the quality and precision of automatically generated radiology reports. We propose an evaluation model incorporating domain-specific knowledge from a radiology-aware knowledge graph. We train the RadEvalX model using two corpora, the Best Match corpus and the Top 10 % corpus, which contain pairs of ground truth reports that are similar in terms of their RadGraph representation. We evaluate the performance of the model on a test set and compare it to other established metrics such as BLEU, BERTScore, CheXbert, RadGraph F1, and RadCliQ. We find that our model performs well and correlates highly with these metrics. Additionally, when using the new ReXVal [16] dataset of human annotations to compare our alignment with human judgment, we find a high correlation that even surpasses RadCliQ for most report pairs. When conducting our human annotation study (i.e., RadEvalX dataset), we did not find a direct high correlation with our human annotators. Still, when comparing the other metrics' agreement with the identical human scores, we also performed better in some cases. The data and insights shared in our work aim to contribute to the advancement of evaluation metrics in the context of radiology report generation. More details are available in [11].


Methods

To further investigate the alignment of the automated evaluation metrics with radiologists, we created a balanced dataset of 100 reports from the IU-Xray dataset [17] and corresponding generated reports using M2Tr [18] for human annotation. The final dataset comprised 80 abnormal and 20 normal reports. We are inspired by the work of [19,16], where the authors asked a radiologist to count the number of clinically significant and insignificant errors observed in the predicted report for each pair of prediction and ground truth and categorize them into error categories.

Data

We created a balanced dataset of 100 reports for human annotation from an initial set of 590 reports generated using M2Tr [18]. The dataset balance was achieved by categorizing the reports into low, average, and high groups based on the 0.33 quantiles of the RadCliQ metric score. Random sampling was then performed to select 150 reports from each category. The reports were further filtered to separate normal and abnormal categories, excluding those labeled as normal in the 'mesh-0' column and removing reports with empty 'mesh-1' values. The remaining abnormal reports were then filtered based on the 'IMPRESSION' column, removing those containing specific phrases associated with normal reports, "no acute findings." The complete list of filters can be found in [11]. The resulting dataset comprised 80 abnormal and 20 normal reports.

Participants

Two board-certified radiologists annotated the report pairs of model-generated and ground-truth reports. The radiologists were given instructions provided in [16] before their annotation task. The ground-truth reports are also given to the radiologist without the corresponding X-ray images.

Task

Like [4], we asked a radiologist to count the number of clinically significant and insignificant errors observed in the predicted report for each pair of prediction and ground truth and categorize them into one of the following categories (The categories with '†' are added by us).

  •  False prediction of finding
  •  Omission of finding
  • Incorrect location/position of finding
  • Incorrect severity of the finding
  • Mention of comparison that is not present in the reference impression
  • Omission of comparison describing a change from a previous study
  • Mention of uncertainty that is not present in the reference †
  • Omission of uncertainty that is present in the reference †

To accomplish the study, initially, two board-certified radiologists independently identified and extracted the positive findings from the ground truth reports. The positive findings were then classified into significant and insignificant ones. A comparison was made between the findings extracted from the ground truth reports and the generated reports. By referring to the eight predefined error categories, the number of errors for each category was counted based on the results of the comparison. Ultimately, both radiologists engaged in discussions with each other and reached a consensus for each report. 


Data Description

The dataset contains the consensus of 2 radiologists' evaluation of clinically significant and clinically insignificant errors under eight error categories for generated radiology reports by the M2Tr model [18]  against ground-truth reports from the IU-Xray dataset [17].

The dataset is organized as follows:

  1. RadEval_clinically_significant_errors.csv: Each row corresponds to an annotation provided by the agreement of two radiologists on a specific report for a specific error category in which the error is clinically significant. It contains "report_id" identical to its origin dataset (i.e., IU-Xray [17]), one ground-truth report (column “ground_truth”), and a generated report by M2Tr model (column "M2Tr-Generation"). The remaining columns are one out of the eight error categories defined in the Methods section, ranging from 1 to 8 where the radiologists determined the number of errors in the generated report for each category.
  2. RadEval-clinically_insignificant_errors.csv: Each row corresponds to an annotation provided by the agreement of two radiologists on a specific report for a specific error category in which the error is clinically insignificant. It contains "report_id" identical to its origin dataset (i.e., IU-Xray [17]), one ground-truth report (column “ground_truth”), and a generated report by M2Tr model (column "M2Tr-Generation"). The remaining columns are one out of the eight error categories defined in the Methods section, ranging from 1 to 8 where the radiologists determined the number of errors in the generated report for each category.
  3. metrcis_scores_m2tr.csv: For each report pair in the dataset we compute the value for established metrics such as BLEU, BERTScore, CheXbert, Rad-Graph F1, and RadCliQ and report in this file.

Usage Notes

Usage of this dataset involves the usage of the IU-Xray dataset [17].  Any utilization of this dataset must cite the IU-Xray paper [17].

This dataset has been used in [11] to investigate the alignment of the automated evaluation metrics with radiologists. It should be noted that unlike the ReXval dataset[16], this dataset has been created using generated radiology reports by the M2Tr model [18] and could be used as representative of generated reports that differ significantly from ground-truth reports.

One limitation of our study is that different radiologists evaluating the reports often gave different scores, even though the effort was to make the evaluation scheme objective and consistent. This variability among radiologists is a common issue when using subjective ratings from clinicians. It suggests that our evaluation scheme may have limitations, and it might be challenging to evaluate radiology reports objectively.


Release Notes

This is the first public release of the dataset.


Ethics

We exclusively utilized publicly available datasets (i.e., IU-Xray [17]) that are anonymized and de-identified, addressing privacy concerns. 


Conflicts of Interest

The Authors declare no Competing Non-Financial Interests .


References

  1. Kathrin Blagec, Georg Dorffner, Milad Moradi, Simon Ott, and Matthias Samwald (2022). “A global analysis of metrics used for measuring performance in natural language processing”. In Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP, pages 52–63, Dublin, Ireland. Association for Computational Linguistics
  2. Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra (2022). “A survey of evaluation metrics used for NLG systems”. ACM Comput. Surv., 55(2)
  3. Jekaterina Novikova, Ondˇrej Dušek, Amanda Cer- cas Curry, and Verena Rieser (2017). “Why we need new evaluation metrics for NLG”. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics
  4. Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, and Steffen Eger (2022). “Towards explainable evaluation metrics for natural language generation”. ArXiv, abs/2203.11131
  5. Christy Y. Li, Xiaodan Liang, Zhiting Hu, and Eric P. Xing (2018). “Hybrid retrieval-generation reinforced agent for medical image report generation”. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 1537–1547, Red Hook, NY, USA. Curran Asso- ciates Inc.
  6. Xin Huang, Fengqi Yan, Wei Xu, and Maozhen Li (2019). “Multi-attention and incorporating background information model for chest x-ray image report generation”. IEEE Access, 7:154808–154817
  7. Yixiao Zhang, Xiaosong Wang, Ziyue Xu, Qihang Yu, Alan Yuille, and Daguang Xu (2020). “When radiology report generation meets knowledge graph”. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12910–12917
  8. Mohammad Alsharid, Harshita Sharma, Lior Drukker, Pierre Chatelain, Aris T. Papageorghiou, and J. Ali- son Noble (2019). “Captioning ultrasound images au- tomatically”. In Medical Image Computing and Com- puter Assisted Intervention – MICCAI, pages 338–346, Cham. Springer International Publishing
  9. Pablo Messina, Pablo Pino, Denis Parra, Alvaro Soto, Cecilia Besa, Sergio Uribe, Marcelo Andía, Cristian Tejos, Claudia Prieto, and Daniel Capurro (2022). “A survey on deep learning and explainability for automatic report generation from medical images”. ACM Comput. Surv., 54(10s).
  10. Curtis P Langlotz (2015). “Radiology report: a guide to thoughtful communication for radiologists and other medical professionals”. Independent Publishing Platform, San Bernardino, CA.
  11. Amos Calamida and Farhad Nooralahzadeh and Morteza Rohanian and Koji Fujimoto and Mizuho Nishio and Michael Krauthammer (2023). “Radiology-Aware Model-Based Evaluation Metric for Report Generation”. ArXiv 2311.16764
  12. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie (2020). “COMET: A neural framework for MT evaluation”. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  13. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy (2017). “Improved image captioning via policy gradient optimization of SPIDEr”. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE.
  14. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu (2002). “BLEU: A method for automatic evaluation of machine translation”. In Proceedings of he 40th Annual Meeting on Association for Computa- tional Linguistics, ACL ’02, page 311–318, Philadel- phia, Pennsylvania, USA. Association for Computational Linguistics
  15. Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis P Langlotz, and Pranav Ra- jpurkar (2021). “Radgraph: Extracting clinical entities and relations from radiology reports”. PhysioNet.
  16. Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Eduardo Pontes Reis, Eduardo Kaiser Ururahy Nunes Fonseca, Henrique Lee, Zahra Shak- eri, Andrew Ng, Curtis Langlotz, Vasantha Kumar Venugopal, and Pranav Rajpurkar (2023). “Radiology Report Expert Evaluation (ReXVal) Dataset (version 1.0.0)”. PhysioNet
  17. Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani, S., Thoma, G. R., & McDonald, C. J. (2016). Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association : JAMIA, 23(2), 304–310. https://doi.org/10.1093/jamia/ocv080
  18. Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara (2020). “Meshed-Memory Trans- former for Image Captioning”. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
  19. Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Eduardo Pontes Reis, Eduardo Kaiser Ururahy Nunes Fonseca, Henrique Min Ho Lee, Zahra Shakeri Hossein Abad, Andrew Y. Ng, Curtis P. Langlotz, Vasantha Kumar Venugopal, and Pranav Rajpurkar (2022). “Evaluating progress in automatic chest x-ray radiology report generation”. medRxiv

Share
Access

Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.

License (for files):
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License

Discovery

DOI (version 1.0.0):
https://doi.org/10.13026/tp88-q278

DOI (latest version):
https://doi.org/10.13026/72dp-f846

Corresponding Author
You must be logged in to view the contact information.

Files

Total uncompressed size: 143.9 KB.

Access the files
Folder Navigation: <base>
Name Size Modified
LICENSE.txt (download) 0 B 2024-06-05
RadEval-clinically_insignificant_errors.csv (download) 40.6 KB 2024-01-10
RadEval_clinically_significant_errors.csv (download) 40.6 KB 2024-01-10
SHA256SUMS.txt (download) 382 B 2024-06-18
metrcis_scores_m2tr.csv (download) 62.3 KB 2024-01-10