Database Credentialed Access

BRAX, a Brazilian labeled chest X-ray dataset

Published: June 17, 2022. Version: 1.1.0

Reis, E. P., Paiva, J., Bueno da Silva, M. C., Sousa Ribeiro, G. A., Fornasiero Paiva, V., Bulgarelli, L., Lee, H., dos Santos, P. V., brito, v., Amaral, L., Beraldo, G., Haidar Filho, J. N., Teles, G., Szarf, G., Pollard, T., Johnson, A., Celi, L. A., & Amaro, E. (2022). BRAX, a Brazilian labeled chest X-ray dataset (version 1.1.0). PhysioNet. https://doi.org/10.13026/grwk-yh18.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

The Brazilian labeled chest x-ray dataset (BRAX) is an automatically labeled dataset designed to assist researchers in the validation of machine learning models. The dataset contains 24,959 chest radiography studies from patients presenting to a large general Brazilian hospital. A total of 40,967 images are available in the BRAX dataset. All images have been verified by trained radiologists and de-identified to protect patient privacy. Fourteen labels were derived from free-text radiology reports written in Brazilian Portuguese using Natural Language Processing.

Background

Chest radiographs are a major part of the imaging studies in hospitals worldwide[1]. Due to intensive work routines and the need for fast diagnoses, chest radiographs are often evaluated by the requesting physicians, who despite having received training in interpreting chest radiographs are not experts in their interpretation in the same manner as thoracic radiologists [2,3]. Moreover, the demand for the specialized evaluation of x-rays usually exceeds the available number of radiologists [4]. The use of Machine Learning (ML) algorithms to support clinical decisions has become increasingly popular in various radiology contexts: workflow optimization [5], detecting relevant imaging alterations to support disease diagnosis [6], and also automated generation of radiology reports [7,8]. These solutions can be especially useful in undeveloped regions and communities where there is a shortage of radiologists [9]. However, in order to develop ML solutions for radiology, high-quality annotation and a larger number of datasets are required to train and validate algorithms. Geographic diversity – to account for demographic and phenotypic variation – is also particularly important to the generalizability of AI models [10].

Various initiatives have been developed in recent years, mainly including data from high-income countries, with reports written in English [10]. This is extremely relevant since Natural Language Processing (NLP) algorithms are heavily dependent on the language – i.e. the majority of NLP algorithms used for extraction of labels only work for English-based datasets (e.g., CheXpert [9] and MIMIC-CXR [11]). Fourteen labels were derived through NLP from free-text radiology reports written in Brazilian Portuguese. The NLP solution was largely based on the CheXpert labeler, adapted to detect negation and uncertainty in Portuguese [12].

We hope this dataset can contribute to reducing the number of under-represented populations in the available pool of chest radiograph datasets used for the development of models for clinical decision support.

Methods

Data collection

All data was obtained from Hospital Israelita Albert Einstein (HIAE). Images were extracted from PACS (Picture Archiving and Communication System). All chest radiography studies with available reports in the institutional PACS were considered for inclusion. Studies that contained radiographs with burned-in sensitive data (i.e. patient name, patient identity, and image display specifications) or indication of rare prostheses that could facilitate patient identification were excluded.

Anonymization procedure

DICOM header anonymization was accomplished using an algorithm developed in-house based on a previously described procedure [13,14] and followed the rules of the MIRC ClinicalTrialProcessor (CTP) DICOM Anonymizer. The application removed DICOM metadata that could be used to identify patients, without compromising the relevant clinical information. We also added an extra conservative step by removing any free-text fields contained in the header. The fields StudyDate, SeriesDate, AcquisitionDate and ContentDate have been properly anonymized by a hashing procedure (i.e. fictitious dates), retaining only the original time intervals between the studies acquisitions, so that chronological information is not lost.

Images were reviewed by a board-certified radiologist (E.P.R.) with over 2 years of experience to identify burned-in sensitive data. The images were also double-checked by 5 other radiologists with up to 2 years of experience (M.C.B.S, H.M.H.L, G.L.B, V.M.B, and L.T.W.A) in a way that each chest radiograph was reviewed by two radiologists in order to increase confidence in the application of exclusion criteria.

Automated labeling of the radiology reports

Labels were extracted from free-text radiology reports using natural language processing. We adapted NegEx [16] for Portuguese by translating a list of English negation and uncertainty triggers, initially applying Google Translator [17] in order to speed up the process before human verification. New triggers were also included based on the technical knowledge of Brazilian Portuguese expressions related to “negation” and “uncertainty” commonly found in radiographic reports. Both translated list verification and inclusion of new triggers were conducted by three Brazilian Portuguese native speaking radiologists.  Subsequently, we used the Chexpert Label Extraction Algorithm [9] to derive labeling from either the findings section or the final section of the report (if neither impression nor findings sections were present). We considered 14 labels – Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged cardiomediastinum, Fracture, Lung lesion, Lung opacity, No findings, Pleural effusion, Pleural other, Pneumonia, Pneumothorax, and Support devices – representing the most common chest radiographic observations, in line with previous studies [9,11].

Data Description

BRAX dataset provides 40,967 images, 24,959 imaging studies for 19,351 patients presenting to the Hospital Israelita Albert Einstein. An overview of the released dataset folder structure is provided below. PatientID refers to a unique identifier for a single patient. The same patient can have multiple studies. A collection of images associated with a single report is referred to as a study, identified by a unique identifier, the AccessionNumber. Radiograph images in different view positions (usually frontal or lateral views) can be found separated in different series or in the same series depending on the modality and in how the DICOMs were generated during acquisition.

Overview

BRAX contains:

• Anonymized_DICOMs folder: all DICOM images organized in the sub-folders according to the patient identifier, studies, series and images (see the section Folder Structure)
• images folder: the same structure as the Anonymized_DICOMs folder but containing PNG files instead of DICOM files
• master_spreadsheet.csv: the main dataset table containing the identifiers for each image, its labels and associated metadata. Columns are detailed below.

Columns

• DicomPath: Path to the DICOM images. As part of the de-identification procedure, the DICOM’s were assigned randomly generated ID numbers.
• PngPath: Path to the Png images.
• PatientID: Patient's identifier. As part of the de-identification procedure, the Patient ID's were created with randomly generated numbers.
• PatientSex: Patient's sex. Enumerated Values: “M” for male; “F” for female; “O” other.
• PatientAge: Age of the patient is provided in 5-year age groups. Patients aged either 85 or over are classified as "85 or more".
• AccessionNumber: A DICOM identifier of the Study. As part of the de-identification procedure, the AccessionNumber was randomly generated.
• StudyDate: Fictitious date of the study.
• Labels: The following columns indicate the labels. The code "1" is assigned for positive, "0" for negation and "-1" for uncertainty.
• No Finding. Value is 1 if no other label is present, except for support devices.
• Enlarged Cardiomediastinum
• Cardiomegaly
• Lung Lesion
• Lung Opacity
• Edema
• Consolidation
• Pneumonia
• Atelectasis
• Pneumothorax
• Pleural Effusion
• Pleural Other
• Fracture
• Support Devices
• ViewPosition: Radiographic view associated with Patient Position. Defined Terms: AP - Anterior/Posterior; PA - Posterior/Anterior; LL - Left Lateral; RL - Right Lateral; RLD - Right Lateral Decubitus; LLD - Left Lateral Decubitus; RLO - Right Lateral Oblique; LLO - Left Lateral Oblique.
• Rows: Size (number of pixels) in the vertical axis of the image matrix.
• Columns: Size (number of pixels) in the horizontal axis of the image matrix.
• Manufacturer: Index of the manufacturer of the CT scanner. The Manufacturer's name is coded in integers to conceal the actual manufacturer but still allow future research to be conducted on possible biases related to the vendor/machine settings.

Folder Structure

Images files are provided in individual folders. An example of the folder structure for a single patient's images is as follows:

Anonymized_DICOMs
│
└── id_00082e3a-ec11c281-24a79518-35d3cc78-22432fb1
│
├── Study_09342613.22970294.40563343.35634289.53163857
│   │
│   ├── Series_34523850.21768222.07508551.49190893.14603932
│   │   └── image-48219538-15808688-10728535-52591088-74513595.dcm
│   │
│   └── Series_46177599.95157937.50203011.63555832.78161828
│       └── image-16153862-94805167-26028517-34518684-13054667.dcm
│
└── Study_51027964.83117427.20948980.39828954.71003607
│
├── Series_57104384.74837822.26263330.97688944.88328246
│   └── image-48651870-23127024-63651831-17193122-94277772.dcm
│
└── Series_72993604.79060724.14705971.37953714.05369399
└── image-08788867-77959894-95405066-47915205-10581326.dcm

Above, we have a single patient, the folder name starts with "id" followed by the number of the "PatientID" DICOM Tag. This patient has two radiographic studies, the study folder name starts with "Study" followed by the number of the "StudyInstanceUID" DICOM Tag. Each study has one or more series folders, the series folder name starts with "Series" followed by the number of the "SeriesInstanceUID" DICOM Tag, and inside each series folder you may find one or more x-ray DICOM files, the image file name starts with "image" followed by the number of the "SOPInstanceUID" DICOM Tag. All identifiers were randomly generated, and their order has no implications for the chronological order of the actual studies.

Usage Notes

This is the first Brazilian chest x-ray dataset and future releases may provide greater volumetry. Free-text reports are not yet provided in the current version. We hope this dataset can contribute to reducing the number of under-represented populations in the available pool of chest radiograph datasets used for the development of models for clinical decision support. Best practice guidelines should be followed when analyzing the data.

Release Notes

v1.0.0: Initial release.

v1.1.0: Includes an updated version of master_spreadsheet.csv to fix a CSV formatting issue.

For future releases we wish to include data related to social determinants of health, such as the neighborhood where the patient lives. Inclusion of sensitive attributes in a way that does not significantly increase the risk of re-identification is important in order to tackle biases that are known to disproportionately impact marginalized populations.

Ethics

The project was approved by the Institutional Review Board of Hospital Israelita Albert Einstein (#35503420.8.0000.0071). The requirement for individual patient consent was waived. The study database was anonymized, with all identifiable patient information removed.

Acknowledgements

The creation of this dataset was funded by the MIT-Brazil TVML Seed Fund award (project "Developing a Publicly Accessible Brazilian Dataset of Chest X-Rays”).

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

2. Singh R, Kalra MK, Nitiwarangkul C, Patti JA, Homayounieh F, Padole A, et al. Deep learning in chest radiography: Detection of findings and presence of change. PloS One. 2018;13(10):e0204155.
3. Putha P, Tadepalli M, Reddy B, Raj T, Chiramal JA, Govil S, et al. Can Artificial Intelligence Reliably Report Chest X-Rays?: Radiologist Validation of an Algorithm trained on 2.3 Million X-Rays. ArXiv180707455 Cs [Internet]. 2019 Jun 4; Available from: http://arxiv.org/abs/1807.07455
4. Dall T, Reynolds R, Jones K, Chakrabarti R, Iacobucci W. The Complexities of Physician Supply and Demand: Projections from 2017 to 2032. Assoc Am Med Coll. 2019;86–86.
5. Letourneau-Guillon L, Camirand D, Guilbert F, Forghani R. Artificial Intelligence Applications for Workflow, Process Optimization and Predictive Analytics. Neuroimaging Clin N Am. 2020;30(4):e1–15.
6. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis [Internet]. The Lancet Digital Health. 2019. p. e271–97. Available from: https://www.thelancet.com/journals/landig/article/PIIS2589-7500(19)30123-2/fulltext
7. Monshi MMA, Poon J, Chung V. Deep learning in generating radiology reports: A survey [Internet]. Artificial Intelligence in Medicine. 2020. Available from: https://www.sciencedirect.com/science/article/pii/S0933365719302635
8. Babar Z, van Laarhoven T, Zanzotto FM, Marchiori E. Evaluating diagnostic content of AI-generated radiology reports of chest X-rays [Internet]. Artificial Intelligence in Medicine. 2021. Available from: https://www.sciencedirect.com/science/article/pii/S0933365721000683
9. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. 33rd AAAI Conf Artif Intell AAAI 2019 31st Innov Appl Artif Intell Conf IAAI 2019 9th AAAI Symp Educ Adv Artif Intell EAAI 2019. 2019 Sep;590–7.
10. Kaushal A, Altman R, Langlotz C. Geographic distribution of US cohorts used to train deep learning algorithms [Internet]. JAMA - Journal of the American Medical Association. 2020. p. 1212–3.
11. Johnson AEW, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C ying, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 2019 Sep;6(1):317–317.
12. Portuguese language. In: Wikipedia [Internet]. 2021. Available from: https://en.wikipedia.org/w/index.php?title=Portuguese_language&oldid=1045527814
13. Mayo RC, Leung J. Artificial intelligence and deep learning – Radiology’s next frontier? Clin Imaging. 2018;49:87–8.
14. National Electrical Manufacturers Association. PS3.15 [Internet]. Digital imaging and communications in medicine (DICOM) PS3.15 2020b - Security and System Management Profiles. 2020. Available from: http://dicom.nema.org/medical/dicom/current/output/html/part15.html
15. Lowekamp BC, Chen DT, Ibanez L, Blezek D. The Design of SimpleITK. Front Neuroinformatics. 2013 Dec;0:45.
16. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries. J Biomed Inform. 2001 Oct;34(5):301–10.
17. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. 2016 Sep; Available from: https://arxiv.org/abs/1609.08144v2

Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.