Database Credentialed Access

# VinDr-CXR: An open dataset of chest X-rays with radiologist annotations

Published: June 22, 2021. Version: 1.0.0

Nguyen, H. Q., Pham, H. H., tuan linh, l., Dao, M., & khanh, l. (2021). VinDr-CXR: An open dataset of chest X-rays with radiologist annotations (version 1.0.0). PhysioNet. https://doi.org/10.13026/3akn-b287.

Ha Q. Nguyen, Khanh Lam, Linh T. Le, Hieu H. Pham, Dat Q. Tran, Dung B. Nguyen, Dung D. Le, Chi M. Pham, Hang T. T. Tong, Diep H. Dinh, Cuong D. Do, Luu T. Doan, Cuong N. Nguyen, Binh T. Nguyen, Que V. Nguyen, Au D. Hoang, Hien N. Phan, Anh T. Nguyen, Phuong H. Ho, Dat T. Ngo, Nghia T. Nguyen, Nhan T. Nguyen, Minh Dao, Van Vu. "VinDr-CXR: An open dataset of chest X-rays with radiologist's annotations." arXiv preprint arXiv:2012.15029 (2020).

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

## Abstract

We describe here a dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, we release 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases. The released dataset is divided into a training set of 15,000 and a test set of 3,000. Each scan in the training set was independently labeled by 3 radiologists, while each scan in the test set was labeled by the consensus of 5 radiologists. All images are in DICOM format and the labels from training and test sets are made publicly available.

## Background

Building high-quality datasets of annotated images is costly and time-consuming due to several constraints: (1) medical data are hard to retrieve from hospitals or medical centers; (2) manual annotation by physicians is both time consuming and expensive; (3) the annotation of medical images requires a consensus of several expert readers to overcome human error; and (4) it lacks an efficient labeling framework to manage and annotate large-scale medical datasets.

Most existing chest radiograph (also known as the chest x-ray or CXR) datasets [3,8,10,11,12] depend on automated rule-based labelers that either use keyword matching [3,10] or an NLP model [17] to extract disease labels from free-text radiology reports. These tools can produce labels on a large scale but, at the same time, introduce a high rate of inconsistency, uncertainty, and errors [13, 18]. These noisy labels may lead to the deviation of deep learning-based algorithms from reported performances when evaluated in a real-world setting [19]. Furthermore, the report-based approaches only associate a CXR image with one or several labels in a predefined list of findings and diagnoses without identifying their locations.

There are a few CXR datasets that include annotated locations of abnormalities but they are either too small for training deep learning models or not detailed enough. The interpretation of a CXR is not all about image-level classification; it is even more important, from the perspective of a radiologist, to localize the abnormalities on the image. This partly explains why the applications of computer-aided detection (CAD) systems for CXR in clinical practice are still very limited.

In an effort to provide a large CXR dataset with high-quality labels for the research community, we have built the VinDr-CXR dataset from more than 100,000 raw images in DICOM format that were retrospectively collected from the Hospital 108 (H108) and the Hanoi Medical University Hospital (HMUH), two of the largest hospitals in Vietnam. The published dataset consists of 18,000 postero-anterior (PA) view CXR scans that come with both the localization of critical findings and the classification of common thoracic diseases. These images were annotated by a group of 17 radiologists with at least 8 years of experience for the presence of 22 critical findings (local labels) and 6 diagnoses (global labels); each finding is localized with a bounding box. The local and global labels correspond to the “Findings” and “Impressions” sections, respectively, of a standard radiology report.

We divide the dataset into two parts: the training set of 15,000 scans and the test set of 3,000 scans. Each image in the training set was independently labeled by 3 radiologists, while the annotation of each image in the test set was even more carefully treated and obtained from the consensus of 5 radiologists. The labeling process was performed via an in-house system called VinDr Lab (https://vindr.ai/vindr-lab), which was built on top of a Picture Archiving and Communication System (PACS). All DICOM images and the labels of the training and test sets are released.

VinDr-CXR, to the best of our knowledge, is currently the largest public CXR dataset with radiologist-generated annotations in both training and test sets. We believe the dataset will accelerate the development and evaluation of new machine learning models for both localization and classification of thoracic lesions and diseases on CXR scans.

## Methods

The building of VinDr-CXR dataset is divided into three main steps: (1) data collection, (2) data filtering, and (3) data labeling. Between 2018 and 2020, we retrospectively collected more than 100,000 CXRs in DICOM format from local PACS servers of two hospitals in Vietnam, the HMUH and H108. Imaging data were acquired from a wide diversity of scanners from well-known medical equipment manufacturers, including Phillips, GE, Fujifilm, Siemens, Toshiba, Canon, and Samsung. The ethical clearance of this study was approved by the Institutional Review Board (IRB) of the HMUH and H108 before the study started. The need for obtaining informed patient consent was waived because this retrospective study did not impact clinical care or workflow at these two hospitals and all patient-identifiable information in the data has been removed.

### Data de-identification

To protect patient’s privacy [20], all personally identifiable information associated with the images has been removed or replaced with random values. Specifically, we ran a Python script that removes all DICOM tags of protected health information (PHI) [21] such as: patient’s name, patient’s date of birth, patient ID, or acquisition time and date, etc. We only retained a limited number of DICOM attributes that are necessary for processing raw images. The full list of DICOM tags that were retained for loading and processing raw images is provided in the supplemental file (supplemental_file_DICOM_tags.pdf). Next, a simple algorithm was implemented to automatically remove textual information appearing on the image data (i.e. pixel annotations that could include patient’s identifiable information). The resulting images were then manually verified to make sure all text was removed before they were digitally sent out of the hospitals’ systems.

All DICOM metadata was parsed and manually reviewed to ensure that all individually identifiable health information of the patients has been removed to meet the U.S. HIPAA [22], the European GDPR [23], as well as the local privacy laws [20]. Pixel values of all CXR scans were also carefully examined. All images were manually reviewed case-by-case by a team of 10 human readers. During this review process, a small number of images containing private textual information that had not been removed by our algorithm were excluded from the dataset. The manual review process also helped identify and discard out-of-distribution samples that the CNN-based classifier was not able to detect. To control the quality of the labeling process, we developed a set of rules underlying VinDr Lab for automatic verification of radiologist-generated labels. These rules prevent annotators from mechanical mistakes like forgetting to choose global labels or marking lesions on the image while choosing “No finding” as the global label. To ensure the complete blindness among annotators, the images were randomly shuffled before being assigned to each of them.

### Data filtering

The collected raw data was mostly of adult posterior anterior (PA) and anterior posterior (AP) CXRs, but also included a significant amount of outliers such as images of body parts other than chest (due to mismatched DICOM tags), pediatric scans, low-quality images, or lateral CXRs. All outliers were automatically excluded from the dataset using a binary classifier, which is a light-weight convolutional neural network (CNN). The training procedure of this classifier is out of the scope of this project.

### Data labeling

The VinDr-CXR dataset was labeled for a total of 28 findings and diagnoses in adult cases: (1) Aortic enlargement, (2) Atelectasis, (3) Cardiomegaly, (4) Calcification, (5) Clavicle fracture, (6) Consolidation, (7) Edema, (8) Emphysema, (9) Enlarged PA, (10) Interstitial lung disease (ILD), (11) Infiltration, (12) Lung cavity, (13) Lung cyst, (14) Lung opacity, (15) Mediastinal shift, (16) Nodule/Mass, (17) Pulmonary fibrosis, (18) Pneumothorax, (19) Pleural thickening, (20) Pleural effusion, (21) Rib fracture, (22) Other lesion, (23) Lung tumor, (24) Pneumonia, (25) Tuberculosis, (26) Other diseases, (27) Chronic obstructive pulmonary disease (COPD), and (28) No finding. These labels were divided into 2 categories: local labels (1-22) and global labels (23-28). Note that there was no patient confirmed with positive COVID-19 disease.

The local labels should be marked with bounding boxes that localize the findings, while the global labels should reflect the diagnostic impression of the radiologist. This list of labels was suggested by a committee of the most experienced radiologists from the two hospitals. The selection of these labels took two factors into account: first, they are prevalent and second, they can be differentiated on CXRs. To facilitate the labeling process, we designed and built a web-based framework called VinDr Lab and had a team of 17 experienced radiologists remotely annotate the data. All the radiologists participating in the labeling process were certified in diagnostic radiology and received healthcare profession certificates from the Vietnamese Ministry of Health.

A set of 18,000 CXRs were randomly chosen from the filtered data, of which 15,000 scans serve as the training set and the remaining 3,000 form the test set. Each sample in the training set was assigned to 3 radiologists for annotating in a blind fashion. Additionally, all of the participating radiologists were blinded to relevant clinical information. For the test set, 5 radiologists were involved in a two-stage labeling process. During the first stage, each image was independently annotated by 3 radiologists. In the second stage, 2 other radiologists, who have a higher level of experience, reviewed the annotations of the 3 previous annotators and communicated with each other in order to decide the final labels. The disagreements among initial annotators were carefully discussed and resolved by the 2 reviewers. Finally, the consensus of their opinions served as the ground truth reference.

Once the labeling was completed, the labels of 18,000 CXRs were exported in JavaScript Object Notation (JSON) format. We then parsed their contents and organized the annotations in the form of a single comma-separated values (CSV) file. As a result, we provided a single CSV file that contains labels, bounding box coordinates, and their corresponding image IDs. For the training set, each sample comes with the annotations of three different radiologists. For the test set, we only provide the consensus labels of the five radiologists. We have released all images together with the labels of the training set and the test set.

## Data Description

The data is organized into three folders, one for training (15,000 studies [normal: 10,606 studies, abnormal: 4394 studies]), one for testing (3,000 studies [normal: 2052 studies, abnormal: 948 studies]) and the other one for annotations.

### Overview

Each image has a unique, anonymous identifier which was encoded from the value of the Service Object Pair (SOP) Instance Unique Identifier (UID) provided by the DICOM tag (0008,0018). The encoding process was supported by the Python hashlib module (see https://github.com/vinbigdata-medical/vindr-cxr/blob/main/de-identification/anonymize.py). The radiologists’ local annotations of the training set were provided in a CSV file, annotations_train.csv. Each row of the table represents a bounding box with the following attributes: image ID (image_id), radiologist ID (rad_id), label’s name (class_name), and bounding box coordinates (x_min, y_min, x_max, y_max). Here, rad_id encodes the identities of the 17 radiologists, (x_min, y_min) are the coordinates of the box’s upper left corner, and (x_max, y_max) are the coordinates of the lower-right corner. Meanwhile, the image-level labels of the training set were stored in a different CSV file, image_labels_train.csv, with the following fields: Image ID (image_id), radiologist ID (rad_ID), and labels (labels) for both the findings and diagnoses. Specifically, each image ID goes with a vector of multiple labels (labels) corresponding to different pathologies, in which positive ones were encoded with “1” and negative ones were encoded with “0”. Similarly, the bounding-box annotations and the image-level labels of the test set were recorded in annotations_test.csv and image_labels_test.csv, respectively. The only difference is that each row in the CSV files of the test set was not associated with a radiologist ID.

### Folder structure

DICOM images and annotations are divided into three separate folders. The structure of dataset is as follows:


Train
|
|
----
|-- 000434271f63a053c4128a0ba6352c7f.dicom
|-- 00053190460d56c53cc3e57321387478.dicom
|-- 0005e8e3701dfb1dd93d53e2ff537b6e.dicom
|-- 0006e0a85696f6bb578e84fafa9a5607.dicom
|-- 0007d316f756b3fa0baea2ff514ce945.dicom
|-- 000ae00eb3942d27e0b97903dd563a6e.dicom
|-- 000d68e42b71d3eac10ccc077aba07c1.dicom
|
.
.
.

Test
|
|
----
|-- 002a34c58c5b758217ed1f584ccbcfe9.dicom
|-- 004f33259ee4aef671c2b95d54e4be68.dicom
|-- 008bdde2af2462e86fd373a445d0f4cd.dicom
|-- 009bc039326338823ca3aa84381f17f1.dicom
|-- 00a2145de1886cb9eb88869c85d74080.dicom
|-- 00b7e6bfa4dc1fe9ddd0ce74743e38c2.dicom
|-- 011295e0bcdc7636569ab73bfdcc4450.dicom
|
.
.
.

Annotations
|
|
------
|-- annotations_test.csv
|-- annotations_train.csv
|-- image_labels_test.csv
|-- image_labels_train.csv


## Usage Notes

The VinDr-CXR dataset was created for the purpose of developing and evaluating algorithms for detecting and localizing anomalies in CXR scans. A part of the dataset has been previously used for organizing a CXR analysis competition on the Kaggle platform (https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection), which focused on the thoracic diseases detection and segmentation. The dataset has also been used previously in a study about federated learning [24].

One limitation of this dataset is that it contained only CXR scans of adult patients. The dataset therefore is not suitable for developing and evaluating algorithms for the detection of CXR pathologies in pediatric patients.

## Release Notes

This is the first public release (v1.0) of the VinDr-CXR dataset.

## Acknowledgements

The authors would like to acknowledge the Hanoi Medical University Hospital and the Hospital 108 for providing us access to their image databases and for agreeing to make the VinDr-CXR dataset publicly available. We are especially thankful to all of our collaborators, including radiologists, physicians, and technicians, who participated in the data collection and labeling process.

## Conflicts of Interest

Vingroup Big Data Institute (VinBigdata) supported the creation of this resource. Ha Quy Nguyen, Hieu Huy Pham and Minh Dao are currently employed by VinBigdata. VinBigdata did not profit from the work done in this project.

## References

1. Rajpurkar, P. et al. CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017)
2. Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Medicine 15, e1002686, https://doi.org/10.1371/journal.pmed.1002686 (2018)
3. Irvin, J. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 590–597 (2019)
5. Rajpurkar, P. et al. CheXpedition: Investigating generalization challenges for translation of chest X-ray algorithms to the clinical setting. arXiv preprint arXiv:2002.11379 (2020)
6. Tang, Y.-X. et al. Automated abnormality classification of chest radiographs using deep convolutional neural networks. npj Digit. Medicine 3, 1–8, https://doi.org/10.1038/s41746-020-0273-z (2020)
7. Pham, H. H., Le, T. T., Tran, D. Q., Ngo, D. T. & Nguyen, H. Q. Interpreting chest X-rays via CNNs that exploit hierarchical disease dependencies and uncertainty labels. arXiv preprint arXiv:1911.06475 (2020)
8. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 512, 436–444, https://doi.org/10.1038/nature14539 (2015)
9. Razzak, M. I., Naz, S. & Zaib, A. Deep learning for medical image processing: Overview, challenges and the future. In Classification in BioApps, 323–350, https://doi.org/10.1007/978-3-319-65981-7_12 (Springer, 2018)
10. Wang, X. et al. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2097–2106, https://doi.org/10.1109/CVPR.2017.369 (2017)
11. Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. Padchest: A large chest X-ray image dataset with multi-label annotated reports. arXiv preprint arXiv:1901.07441 (2019)
12. Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317, https://doi.org/10.1038/s41597-019-0322-0 (2019)
13. Oakden-Rayner, L. Exploring the ChestXray14 dataset: problems. https://lukeoakdenrayner.wordpress.com/2017/12/18/ the-chestxray14-dataset-problems/ (2017). (Online; accessed 04 May 2020)
14. Shiraishi, J. et al. Development of a digital image database for chest radiographs with and without a lung nodule: Receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. Am. J. Roentgenol. 174, 71–74, https://doi.org/10.2214/ajr.174.1.1740071 (2000)
15. Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Informatics Assoc. 23, 304–310, https://doi.org/10.1093/jamia/ocv080 (2016)
16. Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Medicine Surg. 4, 475–477, https://dx.doi.org/10.3978%2Fj.issn.2223-4292.2014.11.20 (2014)
17. Smit, A. et al. CheXbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv preprint arXiv:2004.09167 (2020)
18. Oakden-Rayner, L. Exploring large-scale public medical image datasets. Acad. Radiol. 27, 106 – 112, https://doi.org/10. 1016/j.acra.2019.10.006 (2020). Special Issue: Artificial Intelligence
19. Nagendran, M. et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ 368, https://doi.org/10.1136/bmj.m689 (2020)
20. Vietnamese National Assembly. Regulation 40/2009/QH12 (Law on Medical Examination and Treatment). http://vbpl.vn/ hanoi/Pages/vbpqen-toanvan.aspx?ItemID=10482 (2009). (Online; accessed 11 December 2020)
21. Isola, S. & Al Khalili, Y. Protected Health Information (PHI). https://www.ncbi.nlm.nih.gov/books/NBK553131/ (2019)
22. US Department of Health and Human Services. Summary of the HIPAA privacy rule. https://www.hhs.gov/hipaa/ for-professionals/privacy/laws-regulations/index.html (2003)
23. European Parliament and Council of European Union. Regulation (EU) 2016/679 (General Data Protection Regulation). https://gdpr-info.eu/ (2016). (Online; accessed 11 December 2020)
24. Yuan, Zhuoning, et al. Federated deep AUC maximization for heterogeneous data with a constant communication complexity. arXiv preprint arXiv:2102.04635 (2021).

##### Access

Access Policy:
Only credentialed users who sign the DUA can access the files.