Database Restricted Access
VinDr-PCXR: An open, large-scale pediatric chest X-ray dataset for interpretation of common thoracic diseases
Hieu Huy Pham , Tien Thanh Tran , Ha Quy Nguyen
Published: March 21, 2022. Version: 1.0.0
When using this resource, please cite:
(show more options)
Pham, H. H., Tran, T. T., & Nguyen, H. Q. (2022). VinDr-PCXR: An open, large-scale pediatric chest X-ray dataset for interpretation of common thoracic diseases (version 1.0.0). PhysioNet. https://doi.org/10.13026/k8qc-na36.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Computer-aided diagnosis systems in adult chest radiography (CXR) have recently achieved great success thanks to the availability of large-scale, annotated datasets and the advent of high-performance supervised learning algorithms. However, the development of diagnostic models for detecting and diagnosing pediatric diseases in CXR scans is undertaken due to the lack of high-quality physician-annotated datasets. To overcome this challenge, we introduce and release in this paper a new pediatric CXR dataset of 9,125 studies that were retrospectively collected from a major pediatric hospital in Vietnam between 2020-2021. Each scan was manually annotated by an experienced radiologist for the presence of 36 critical findings and 15 diseases. In particular, each abnormal finding was identified via a rectangle bounding box on the image. To the best of our knowledge, this is the first and largest pediatric CXR dataset containing lesion-level labels and image-level labels for multiple findings and diseases. For algorithm development, the dataset is divided into a training set of 7,728 and a test set of 1,397.
Common thoracic diseases cause several hundred thousand deaths every year among children under five years old [1, 2]. The chest radiograph or CXR is one of the most commonly requested radiographic examinations in the assessment of the pediatric patient . Interpreting CXR scans on pediatric patients often requires a specialist in pediatric diagnostic imaging with an in-depth knowledge of radiological signs of different lung conditions. Computer-aided diagnosis (CAD) systems for the identification of lung abnormality in adult CXRs have recently achieved great success thanks to the availability of large labeled datasets [4, 5, 6, 7, 8]. Unfortunately, the creation of the pediatric CXR dataset is still unexploited, and the number of benchmark pediatric CXR datasets is limited. This becomes the main obstacle in developing and transferring new machine learning-based CAD systems for pediatric CXR in clinical practice.
In an effort to provide a large-scale pediatric CXR dataset with high-quality annotations for the research community, we have built the VinDr-PCXR dataset in DICOM format that was retrospectively collected from three major hospitals in Vietnam from 2020 to 2021. In particular, the dataset consists of 9,125 posteroanterior (PA) view CXR scans in patients younger than ten years and comes with both the localization of critical findings and the classification of common thoracic diseases. Compared to the previous works, the VinDr-PCXR dataset shows two main advantages. First, the dataset is labeled for multiple findings and diseases. Second, the dataset provides bounding box annotations at the lesion level, which is useful for developing explainable AI models for the CXR interpretation.
Data collection was conducted at the Phu Tho Obstetric & Pediatric Hospital (PTOPH) between 2020 – 2021. The ethical clearance of this study was approved by the Institutional Review Boards (IRBs) of the PTOPH. The need for obtaining informed patient consent was waived because this retrospective study did not impact clinical care or workflow at these two hospitals, and all patient-identifiable information in the data has been removed. We retrospectively collected more than 10,000 CXRs in DICOM format from a local picture archiving and communication system (PACS) at PTOPH.
This study follows the HIPAA Privacy Rule  to protect most individually identifiable health information from the DICOM images. To this end, we removed or replaced with random values all personally identifiable information associated with the images via a two-stage de-identification process. At the first stage, a Python script was used to remove all DICOM tags of protected health information (PHI) , such as patient’s name, patient’s date of birth, patient ID, or acquisition time and date, etc. In the second stage, we manually removed all textual information appearing on the image data (i.e. pixel annotations that could include patient’s identifiable information).
The collected raw data included a significant amount of outliers including CXRs of adult patients, body parts other than chest (abdominal, spine, and others), low-quality images, or lateral CXRs. To filter a large number of CXR scans, we trained an alight-weight convolutional neural network (CNN)  to remove all outliers automatically. Next, a manual verification was performed to ensure all outliers had been fully removed.
The VinDr-CXR dataset was labeled for a total of 36 findings and 15 diagnoses. These labels were divided into two categories: local labels (1-36) and global labels (37-51). The selection of these labels took into account two factors: first, they are prevalent, and second, they can be differentiated on CXRs. To facilitate the labeling process, we designed and built a web-based framework called VinDr-Lab  and had a team of 3 experienced radiologists remotely annotate the data. All the radiologists participating in the labeling process were certified in diagnostic radiology and received healthcare profession certificates from the Vietnamese Ministry of Health. Each sample in the training set was assigned to one radiologist for annotating. A set of 9,125 CXRs were randomly labeled from the filtered data, of which 7,728 scans serve as the training set and the remaining 1,397 scans form the test set.
The images were split into two folders: one for training and one for testing. The value of the SOP Instance UID provided by the DICOM tag (0008,0018) was encoded into a unique, anonymous identifier for each image. The annotations of the training set were provided in a CSV file, annotations_train.csv. Each row of the table represents a bounding box with the following attributes: image ID (image_id), radiologist ID (rad_id), label’s name (class_name), and bounding box coordinates (x_min,y_min,x_max,y_max). The identities of the 6 radiologists are encoded by rad_id, the coordinates of the box’s upper left corner are (x_min,y_min), and the coordinates of the box’s lower right corner are (x_max,y_max). Meanwhile, the image-level labels of the training set were stored in different CSV file, image_labels_train.csv, with the following fields: Image ID (image_id), radiologist ID (rad_ID), and labels (labels) for both the findings and diagnoses. Each image ID is associated with a vector of multiple labels corresponding to different pathologies, with positive pathologies encoded with “1” and negative pathologies encoded with “0”. Similarly, the test set’s bounding-box annotations and image-level labels were saved in the files annotations_test.csv and image_labels_test.csv, respectively.
The primary uses for which the VinDr-PCXR dataset was conceptualized include:
- Developing and validating a predictive model for the classification of common thoracic diseases in pediatric patients.
- Developing and validating a predictive model for the localization of multiple abnormal findings via bounding boxes on the pediatric chest X-ray scans.
- The dataset has also been used previously in a study on diagnosing multiple diseases in pediatric chest radiographs using deep learning .
The released dataset remains with limitations that still need to be addressed in the future, including:
- The dataset did not contain clinical information associated with DICOM images, which is important for the interpretation of CXR in children patients.
- The number of examples for rare diseases or findings are limited. Hence, training supervised leaning algorithms, which requires a large-scale annotated dataset, on the VinDr-PCXR dataset to diagnose the rare diseases and findings is not reliable.
One limitation of this dataset is that it contained only CXR scans of adult patients. The dataset therefore is not suitable for developing and evaluating algorithms for the detection of CXR pathologies in pediatric patients.
This is the first public release (v1.0) of the VinDr-PCXR dataset.
The authors declare no ethics concerns. Release of the deidentified data was approved by the Institutional Review Board of Phu Tho Obstetric & Pediatric Hospital (PTOPH).
The authors would like to acknowledge the Phu Tho Obstetric & Pediatric Hospital (PTOPH) for providing us access to their image databases and for agreeing to make the dataset publicly available.
Conflicts of Interest
VinBigData JSC supported the creation of this resource. Ha Quy Nguyen and Thanh T. Tran are currently employed by VinBigdata. VinBigData JSC did not profit from the work done in this project.
- Collaborators, G. . L. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lowerrespiratory tract infections in 195 countries: a systematic analysis for the global burden of disease study 2015.The LancetInfect. Dis.17, 1133–1161 (2017).
- Wardlaw, T. M., Johansson, E. W., Hodge, M., Organization, W. H. & (UNICEF), U. N. C. F. Pneumonia : the forgottenkiller of children (2006).
- Hart, A. & Lee, E. Y. Pediatric chest disorders: Practical imaging approach to diagnosis.Dis. Chest, Breast, Hear. Vessel.2019-2022107–125 (2019).
- Wang, X.et al.ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classificationand localization of common thorax diseases. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2097–2106, https://doi.org/10.1109/CVPR.2017.369 (2017).
- Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. Padchest: A large chest X-ray image dataset with multi-labelannotated reports.arXiv preprint arXiv:1901.07441(2019).
- Irvin, J.et al.CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedingsof the AAAI Conference on Artificial Intelligence, vol. 33, 590–597 (2019).
- Johnson, A. E.et al.MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Sci. Data6, 317, https://doi.org/10.1038/s41597-019-0322-0 (2019).
- Nguyen, H. Q.et al.Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.arXiv preprintarXiv:2012.15029(2020)
- US Department of Health and Human Services. Summary of the HIPAA privacy rule. https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html (2003).
- Isola, S. & Al Khalili, Y. Protected Health Information (PHI). https://www.ncbi.nlm.nih.gov/books/NBK553131/ (2019).
- Pham, H. H., Do, D. V. & Nguyen, H. Q. Dicom imaging router: An open deep learning framework for classification ofbody parts from dicom x-ray scans. arXiv preprint arXiv:2108.06490(2021).
- Tran, T. T.et al.Learning to automatically diagnose multiple diseases in pediatric chest radiographs using deep con-volutional neural networks. InIEEE Conference on Computer Vision and Pattern Recognition Workshop (2021).
Only registered users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0
Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0
- sign the data use agreement for the project