Database Credentialed Access

Chest X-ray Dataset with Lung Segmentation

Wimukthi Indeewara Mahela Hennayake Kasun Rathnayake Thanuja Ambegoda Dulani Meedeniya

Published: Feb. 8, 2023. Version: 1.0.0

When using this resource, please cite: (show more options)
Indeewara, W., Hennayake, M., Rathnayake, K., Ambegoda, T., & Meedeniya, D. (2023). Chest X-ray Dataset with Lung Segmentation (version 1.0.0). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Chest X-ray(CXR) images are prominent among medical images and are commonly administered in emergency diagnosis and treatment corresponding to cardiac and respiratory diseases. Though there are robust solutions available for medical diagnosis, validation of artificial intelligence (AI) in radiology is still questionable. Segmentation is pivotal in chest radiographs that aid in improvising the existing AI-based medical diagnosis process. We provide the CXLSeg dataset: Chest X-ray with Lung Segmentation, a comparatively large dataset of segmented Chest X-ray radiographs based on the MIMIC-CXR dataset, a popular CXR image dataset. The dataset contains segmentation results of 243,324 frontal view images of the MIMIC-CXR dataset and corresponding masks. Additionally, this dataset can be utilized for computer vision-related deep learning tasks such as medical image classification, semantic segmentation and medical report generation. Models using segmented images yield better results since only the features related to the important areas of the image are focused. Thus images of this dataset can be manipulated to any visual feature extraction process associated with the original MIMIC-CXR dataset and enhance the results of the published or novel investigations. Furthermore, masks provided by this dataset can be used to train segmentation models when combined with the MIMIC-CXR-JPG dataset. The SA-UNet model achieved a 96.80% in dice similarity coefficient and 91.97% in IoU for lung segmentation using CXLSeg.


X-ray images use a mechanism where it exposes the body to grains of ionizing radiation, resulting in radiographs that depict the interior of the body [1]. Segmentation of such X-ray images is salient since it aid in detecting lung cancers by isolating only the region of interest [2]. Semantic segmentation is one of the popular deep learning techniques that aid computer vision-related medical image analysis. Robust deep learning techniques have enabled several automatic segmentation methods that can be utilized to generate more precise segmented images [3, 4, 5]. Moreover, medical image segmentation enables accurate analysis of anatomical data by segregating only the most important part of the medical image [6]. However, examining these radiographs is a cumbersome task, and an automated system with advanced deep-learning techniques is highly preferred. Among the limited chest x-ray datasets, Shenzhen and Montgomery [7, 8] are two of the widely used chest x-ray datasets for image segmentation tasks. Compared to the Shenzhen dataset, the Montgomery dataset has a larger lung area in the provided images. In both datasets, images are provided in PNG format. In addition, the Montgomery dataset has provided the manual lung segmented CXR images that they have used to train their segmentation model [9]. JSRT (Japanese Society of Radiological Technology) [10] is another comparatively small Chest x-ray dataset with 247 CXR images of 2048 x 2048 resolution. The JSRT database was created by the Japanese Society of Radiological Technology (JSRT) in cooperation with the Japanese Radiological Society (JRS) in 1998. Another similar kind of dataset was introduced by V7-labs, which is a Software company in London, England, that works on providing toolkits for training data engines. Their V7-labs COVID-19 X-ray dataset [11] initially contained 1602 normal images, 4250 Pneumonia images, and 439 Covid-19 images, but their lateral x-rays do not contain lung segmentations.

In all the datasets mentioned above, a text file is provided with the patient’s age, gender, and abnormality of each image. Compared to the above datasets, our proposed dataset contains 243,324 frontal segmented chest x-ray images, which makes it the largest segmented chest x-ray dataset available with medical reports for medical image segmentation tasks. We used a combination of dice loss and cross-entropy loss to calculate the loss of the segmentation model [11]. We have compared different U-Net variants and selected SA-UNet [13], resulting in the highest CXR image segmentation results. The SA-UNet model yielded scores of 96.03%, 91.97%, 96.05%, and 96.23% for dice coefficient, IoU (Intersection over union), recall, and precision, respectively.

Moreover, a minimum of 12.59% was given as the loss of the segmentation model. The CXLSeg dataset can also be utilized as a lung segmentation dataset for biomedical image segmentation models using the masks provided and the original MIMIC-CXR-JPG dataset. Additionally, the proposed CXLSeg dataset can be used to obtain better performance in medical report generation since visual feature extraction is more effective and corresponding preprocessed reports [14] for each segmented image are provided along with the dataset.


U-Net models are widely used in image segmentation tasks due to their unique architecture and pixel-based loss weighting scheme [15, 16, 17, 13]. When it comes to available segmented CXR datasets, many of them contain only a small number of images, and class imbalance is a common issue. To overcome this, data augmentation techniques are applied to improve the results. In this case, the SA-UNet [13] architecture was used for segmentation because it is more efficient with augmented data. The model was trained using the Adam optimizer with a learning rate of 0.00005 for 20 epochs.

The SA-UNet model was trained using a combined dataset of three popular CXR segmentation datasets: the Shenzhen dataset, the Montgomery dataset, and the V7-Labs Covid-19 dataset. The three datasets are used to achieve better results by training the model with a larger dataset. Data preprocessing steps such as flipping, rotating, zooming, cropping, and scaling was used to address class imbalance [18]. After augmenting the data of the combined dataset, it was found that there were 3940 images of Tuberculosis, 3619 images of Covid-19, 3987 images of Pneumonia, and 3850 images of Normal available, making for a grand total of 15396 images. This combined dataset was split into a ratio of 80:10:10 for training, testing, and validation, respectively. CLAHE [19], a variant of Adaptive histogram equalization, was used for optimized contrast, and the images were resized to 224x224 before being fed into the model.

The trained model was saved and later loaded to segment CXR images from the MIMIC-CXR dataset to obtain the CXLSeg dataset. The MIMIC-CXR dataset contained 243,334 frontal CXR images, but about 10 images were removed after a sanity check due to missing medical reports. Thus, 243,324 frontal CXR images were segmented using the loaded model, and morphological techniques, including erosion and dilation, were used to improve the generated mask further. The segmented mask was obtained by performing a bitwise AND operation on the original image and the generated mask. Both the segmented mask and the mask were saved in a similar folder structure as the original MIMIC-CXR dataset in JPEG format after being resized to 224x224, which significantly reduced the size of the dataset and made it more portable.

Data Description

CXLSeg mainly contains,

  • A set of 10 folders and 6500 sub-folders corresponding to all the JPG format images of segmented x-ray images and the corresponding masks for each individual patient’s frontal view x-ray. (Almost similar to the MIMIC-CXR [20])
  • CXLSeg-metadata.csv file, which provides metadata for each image. As metadata, it includes the fields. Dicom id, Subject id, Study id, Reports, View Position, Performed Procedure Step Description, Rows, Columns, Study Date, Study Time, Procedure Code Sequence Code Meaning, View Code Sequence Code Meaning, and Patient Orientation Code Sequence Code Meaning.
  • CXLSeg-segmented.csv file, which provides the DicomPath and the labels that are derived from the Chexpert Labeler, and the NegBio Labeler
  • CXLSeg-split.csv file, which contains the train, test, and validation split datasets to the ratio 80:10:10. • CXLSeg-segmented.csv file, which provides the image path for each study and the label that was generated by the Chexpert labeler.
  • CXLSeg-mask.csv file, which provides the mask for frontal view images in each study.

Images and masks for each study are structured as follows:

└── p10
    └── p10000032
        ├── s50414267
        │   ├── 02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.jpg
        │   └── 02aa804e-bde0afdd-112c0b34-7bc16630-4e384014-mask.jpg
        ├── s53189527
        │   ├── 2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab.jpg
        │   └── 2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab-mask.jpg
        └── s53911762
            ├── 68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714.jpg
            ├── 68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714-mask.jpg
            ├── fffabebf-74fd3a1f-673b6b41-96ec0ac9-2ab69818.jpg
            └── fffabebf-74fd3a1f-673b6b41-96ec0ac9-2ab69818-mask.jpg

Each patient has a unique id which is followed by the character ‘p’. In the above example, p10000032 is the unique patient’s id. Since it starts from ‘p10’, that patient’s studies are located inside the p10 folder. One patient can have multiple studies, and all studies for that patient are contained in the patient’s folder. This structure is similar to the MIMIC-CXR dataset.

Standard columns available in all the CSV files

The columns below are present in all the CSV files for the user’s convenience.

  • dicom_id - An identifier for the DICOM file. The stem of each JPG image filename is equal to the dicom_id.
  • subject_id - An integer unique for an individual patient.
  • study_id - An integer unique for an individual study (i.e., an individual radiology report with one or more.)

The DicomPath in CXLSeg-segmented.csv (image path) or the DicomPath in CXLSeg-mask.csv (mask path) can be derived from,

  • For the segmented image - subject_id/study_id/dicom_id.jpg
  • For the mask - subject_id/study_id/dicom_id-mask.jpg
Metadata CSV

The Metadata CSV file is almost similar to the metadata CSV file in the MIMIC-CXR dataset. Additionally to the MIMIC-CXR metadata file, the columns study_id, subject_id, and Reports are added to this CSV file.

  • Reports - Medical report for the corresponding image.
  • PerformedProcedureStepDescription - The type of study performed (”CHEST (PA AND LAT)”, ”CHEST (PORTABLE AP)”, etc.)
  • ViewPosition - The orientation in which the chest radiograph was taken (”AP”, ”PA”, ”LAT?ERAL”, etc.)
  • Rows - The height of the image in pixels.
  • Columns - The width of the image in pixels.
  • StudyDate - An anonymized date for the radiographic study. All images from the same study will have the same date and time. Dates are anonymized, but chronologically consistent for each patient. Intervals between two scans have not been modified during de-identification.
  • StudyTime - The time of the study in hours, minutes, seconds, and fractional seconds. The time of the study was not modified during de-identification.
  • ProcedureCodeSequence_CodeMeaning - The human-readable description of the coded procedure (e.g., ”CHEST (PA AND LAT)”. Descriptions follow Simon-Leeming codes [21].
  • ViewCodeSequence_CodeMeaning - The human-readable description of the coded view orienta?tion for the image (e.g. ”postero-anterior”, ”antero-posterior”, ”lateral”)
  • PatientOrientationCodeSequence_CodeMeaning - The human-readable description of the patient orientation during the image acquisition. Three values are possible: ”Erect”, ”Recumbent”, or a null value (missing).
  • DicomPath - Path for each study’s segmented image of frontal images.
Split CSV

This CSV is also almost the same as the MIMIC-CXR split CSV, but here, the data is split into the train, test, and validation to the ratio 80:10:10.

  • split - a string field indicating the data split for this file, one of ’train’, ’validate’, or ’test’.
Segmented CSV

This CSV contains the DicomPath and the labels that are derived from the Chexpert Labeler, and the NegBio Labeler [22, 23]. The present CheXpert labels are,

  • Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity, Pleural Effusion, Pneumonia, Pneumothorax, Pleural Other, Support Devices, No Finding
Mask CSV

This CSV can be used to train a segmentation model with the use of original images of the MIMIC-CXR dataset.

  • DicomPath - mask path of each frontal image.

Usage Notes

This dataset can be mainly used to optimize the visual feature extraction of the lung x-ray images and to train new segmentation models using the masks provided with the dataset and with the help of the MIMIC-CXR dataset. Additionally, this can be utilized for medical report generation tasks using the provided CSV files, including medical reports. Since a split CSV file is provided with the dataset, this can be used to compare different models against each other because all the time, training, validation, and test data would be the same. For convenience, we have provided a code repository [24] to aid users in utilizing the dataset for the classification, segmentation, and report generation tasks more efficiently. The proposed dataset still contains dataset bias similar to the original MIMIC-CXR dataset. And also, the segmented images were not examined by professionals to verify that the lung was properly segmented.

Release Notes

This is an extended version of MIMIC-CXR 2.0.0, which provides JPG formatted lung segmented image files, masks that used for segmentation, script to customize the dataset, and free-text radiology reports.


This dataset is derived from MIMIC-CXR-JPG and exists under the same IRB.


We acknowledge the contributors of the MIMIC-CXR and MIMIC-CXR-JPG datasets for granting permission to use the dataset for research purposes.

Conflicts of Interest

We declare no competing interests.


  1. S. Sudalaimuthu, M. Joy Thomas, S. Senthil Kumar, and V. Vinod Kumar. Effects of x-ray radiation on solid insulating materials. In 2006 IEEE Conference on Electrical Insulation and Dielectric Phenomena, pages 489–492, 2006.
  2. R. D. S. Portela, J. R. G. Pereira, M. G. F. Costa, and C. F. F. Costa Filho. Lung region segmentation in chest x-ray images using deep convolutional neural networks. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC), pages 1246–1249, 2020.
  3. Dulani Meedeniya, Hashara Kumarasinghe, Shammi Kolonne, Chamodi Fernando, Isabel De la Torre D´ıez, and Gon¸calo Marques. Chest x-ray analysis empowered with deep learning: A systematic review. Applied Soft Computing, 126:109319, 2022. doi: //
  4. Hashara Kumarasinghe, Shammi Kolonne, Chamodi Fernando, and Dulani Meedeniya. U-net based chest x-ray segmentation with ensemble classification for covid-19 and pneumonia. International Journal of Online and Biomedical Engineering (iJOE), 18(07):pp. 161–175, 2022.
  5. Masoumeh Dorri Giv. Lung segmentation using active shape model to detect the disease from chest radiography. Journal of Biomedical Physics and Engineering, 11(06), December 2021. doi:
  6. Risheng Wang, Tao Lei, Ruixia Cui, Bingtao Zhang, Hongying Meng, and Asoke K. Nandi. Medical image segmentation using deep learning: A survey. IET Image Processing, 16(5):1243–1267, jan 2022.
  7. Stefan Jaeger, Sema Candemir, Sameer Antani, Y`ı-Xi´ang J W´ang, Pu-Xuan Lu, and George Thoma. Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg., 4(6):475–477, December 2014.
  8. Tianhao Xu and Zhenming Yuan. Convolution neural network with coordinate attention for the automatic detection of pulmonary tuberculosis images on chest x-rays. IEEE Access, 10:86710–86717, 2022. doi:
  9. Yurong Chen, Hui Zhang, Yaonan Wang, Lizhu Liu, Q. M. Jonathan Wu, and Yimin Yang. Tae-seg: Generalized lung segmentation via tilewise autoencoder enhanced network. IEEE Transactions on Instrumentation and Measurement, 71:1–13, 2022. doi: TIM.2022.3217870.
  10. Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Matsumoto, Takeshi Kobayashi, Ken-ichi Komatsu, Mitate Matsui, Hiroshi Fujita, Yoshie Kodera, and Kunio Doi. Development of a digital image database for chest radiographs with and without a lung nodule. American Journal of Roentgenology, 174(1):71–74, 2000. doi:
  11. Xuehai He, Xingyi Yang, Shanghang Zhang, Jinyu Zhao, Yichen Zhang, Eric Xing, and Pengtao Xie. Sample-efficient deep learning for covid-19 diagnosis based on ct scans. medrxiv, 2020.
  12. J. H. Moltz, A. H¨ansch, B. Lassen-Schmidt, B. Haas, A. Genghi, J. Schreier, T. Morgas, and J. Klein. Learning a loss function for segmentation: A feasibility study. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pages 357–360, 2020. doi: https://
  13. Changlu Guo, M´arton Szemenyei, Yugen Yi, Wenle Wang, Buer Chen, and Changqi Fan. Sa-unet: Spatial attention u-net for retinal vessel segmentation, 2020. doi: ARXIV.2004.03696.
  14. Alistair Johnson, Tom Pollard, Seth Berkowitz, Nathaniel Greenbaum, Matthew Lungren, Chihying Deng, Roger Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6:317, 12 2019. doi: https://doi. org/10.1038/s41597-019-0322-0.
  15. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015. doi:
  16. Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. UNet++: redesigning skip connections to exploit multiscale features in image segmentation, 2019. doi:
  17. Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. Attention u-net: Learning where to look for the pancreas, 2018. doi: https://doi. org/10.48550/arXiv.1804.03999.
  18. Ida Arvidsson, Niels Christian Overgaard, Kalle ˚Astr¨om, and Anders Heyden. Comparison of different augmentation techniques for improved generalization performance for gleason grading. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 923–927, 2019. doi:
  19. Garima Yadav, Saurabh Maheshwari, and Anjali Agarwal. Contrast limited adaptive histogram equalization based enhancement for real time video system. In 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pages 2392–2397, 2014. doi:
  20. Alistair E. W. Johnson, Tom J. Pollard, Nathaniel R. Greenbaum, Matthew P. Lungren, Chihying Deng, Yifan Peng, Zhiyong Lu, Roger G. Mark, Seth J. Berkowitz, and Steven Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs, 2019. doi: https: //
  21. Morris Simon, Brian W. Leeming, Howard L. Bleich, Barney Reiffen, Jim Byrd, Donald Blair, and David Shimm. Computerized radiology reporting using coded language. Radiology, 113(2):343– 349, November 1974. doi:
  22. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison, 2019. doi: 07031.
  23. Yifan Peng, Xiaosong Wang, Le Lu, Mohammadhadi Bagheri, Ronald Summers, and Zhiyong Lu. Negbio: a high-performance tool for negation and uncertainty detection in radiology reports, 2017. doi:
  24. Wimukthi Nimalsiri, Mahela Hennayake, Kasun Rathnayake, Thanuja D. Ambegoda, and Dulani Meedeniya. CXLSeg Dataset: Chest X-ray with Lung Segmentation, 12 2022. doi: https: //

Parent Projects
Chest X-ray Dataset with Lung Segmentation was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.