Database Credentialed Access
Generalized Image Embeddings for the MIMIC Chest X-Ray dataset
Published: Feb. 22, 2023. Version: 1.0
When using this resource, please cite:
(show more options)
Sellergren, A., Kiraly, A., Pollard, T., Weng, W., Liu, Y., Uddin, A., & Chen, C. (2023). Generalized Image Embeddings for the MIMIC Chest X-Ray dataset (version 1.0). PhysioNet. https://doi.org/10.13026/pxc2-vx69.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
This database contains image embeddings from the DICOM images of the MIMIC V2.0.0 Chest X-Ray database, generated using the approach described in “Simplified Transfer Learning for Chest Radiography Models Using Less Data”. The image embeddings are compact, information-rich numerical vectors that can be used for classification tasks. Given the nature of the embeddings, less compute resources and even less data are necessary to achieve the same classification performance as the full CXR image in training a model.
The MIMIC Chest X-ray (MIMIC-CXR) Database is a large publicly available dataset of chest radiographs in DICOM format with free-text radiology reports [1,2]. The dataset comprises over 370 thousand images corresponding to over 227 thousand radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA, USA. While MIMIC-CXR is a valuable resource for research, imaging data can present challenges for analysis in terms of data volume and computing requirements.
The CXR Foundation service intends to capture key information of a Chest X-Ray (CXR) into a vector of floating point numerical values referred to as an embedding . This information-rich embedding is extracted from the feature space of a deep learning model training to classify abnormalities within the image. The embeddings can be used to train a fully connected layer network for classification and has been shown to achieve superior performance in comparison to using the full CXR image with the same amount of data [4, 5]. Significantly less computing resources are required for the training, making specialized hardware, such as a GPU, unnecessary.
We applied CXR Foundation service to the MIMIC-CXR database to generate image embeddings for use by the research community. These embeddings are a compact and information-rich representation of the source images. They can be used to train classification models with performance on par with models trained on the source images, while using minimal resources.
The data was prepared by extracting CXR images from the V2.0.0 MIMIC DICOM dataset into a cloud bucket and computing embeddings for the images using the CXR Foundation Service V 1.0 (3). The CXR Foundation model uses the EfficientNet-L2 architecture . It was trained on 821,544 CXRs from India and the US using abnormal vs. normal labels, i.e. the image contained any kind of abnormality, and the Supervised Contrastive loss . The abnormal vs. normal labels were obtained from more granular labels (e.g. pneumothorax, fracture) as well as regular expressions on radiology reports .
Specifically, we applied version 1.0 of the open source CXR Foundation Python toolkit to MIMIC-CXR (version 2.0.0) [1,3]. Each DICOM image is parsed by the CXR Foundation API via an Apache beam pipeline to generate the final set of embeddings, using the following:
python -m run_inference --input_path "gs://your/cloud/bucket/inputs/" --output_path "gs://your/cloud/bucket/outputs/" --embeddings_project gh-rad-validation-cxrembd-deid --endpoint_id 6695981832690728960 --input_file_type='dicom'
The input_path can be either a directory or a Google cloud bucket containing the MIMIC DICOMs. The output path will contain TFRecords based on the image id of the DICOM file names.
The output TFRecords are output per DICOM image based on the dicom_id of the DICOM file, which can be used to link the embedding to other features in the MIMIC database. Hence, for any given <dicom_id>, the file is <dicom_id>.tfrecord. TFRecords are a simple format for recording sequences of binary records . Contained in each file is a serialized TFExample. The TFExample contains feature keys with the following data:
The data consists of a 4,096 information-rich numerical vector per CXR image. The vectors are provided as stored in a TFRecord file on a per-image basis. The image id of the CXR DICOM file is used as the header name of the corresponding TFRecord file. Each TFRecord file contains a serialized TFExample that contains feature keys with the following information:
image/id: the path including the image ID of the original DICOM file.
embedding: the vector of 4,096 float values containing the embedding.
The embedding captures a representation of the CXR image that encodes abnormalities within the image. Images with similar abnormalities would contain embeddings of similar values, whereas two different abnormalities or a normal and abnormal image would contain more distant embeddings.
The embeddings can be used to train X-ray classification models. Significantly less computing resources are required for training models on the embeddings versus images, making specialized hardware such as a GPU unnecessary. For example, we have provided Google Collaboratory Notebooks that show how to train a model for a classification task given the labels per image.
The following provides Python code examples to access the embeddings and to train a model using them. Full examples and code details can be found in the Colab notebook.
Example: Accessing the embedding data from a TFRecord.
raw_dataset = tf.data.TFRecordDataset(glob.glob('./data/outputs/*.tfrecord')) for raw_record in raw_dataset.take(1): example = tf.train.Example() example.ParseFromString(raw_record.numpy()) embedding_vector = example.features.feature[‘embedding’]
Example: Training a fully connected Neural Network with 2 hidden layers using the embeddings and labels associated with the data for classification.
python -m train \ --train_split_name train \ --tune_split_name tune \ --labels_csv ./data/labels.csv \ --head_name AIRSPACE_OPACITY \ --data_dir ./data/outputs/ \ --num_epochs 100
Outside of PhysioNet/MIMIC, users can access CXR Foundation, the model used to generate these embeddings, by submitting an application via our internet form . The model can be used to generate embeddings in the same format for other sets of chest X-ray images. The embeddings generated can be directly used with any classifiers built on top of those shared here or appended to train classifiers with more data.
The CXR Foundation model was trained using only data from the US and India and may not generalize well to data from other countries, patient populations, conditions, or manufacturers not used in training. These limitations may apply to subsets of MIMIC that are not represented in this training data.
This is the first release.
Data consists of extracted image embeddings from the existing MIMIC dataset. The CXR Foundation API does not retain a copy of the images it receives nor does it retain a copy of the embeddings it creates.
The authors would like to acknowledge Arnav Agharwal, Eric Wu, and Chen Xie for their efforts in making the CXR Foundation service and this embeddings data available.
Conflicts of Interest
The authors of this data work for Google LLC.
- Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR Database (version 2.0.0). PhysioNet. https://doi.org/10.13026/C2JT1Q.
- Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317 (2019). https://doi.org/10.1038/s41597-019-0322-0
- CXR Foundation code on GitHub. https://github.com/Google-Health/imaging-research/tree/master/cxr-foundation [Accessed: 12 January 2022]
- Simplified Transfer Learning for Chest Radiography Model Development [Internet]. [cited 2022 Dec 15]. Available from: https://ai.googleblog.com/2022/07/simplified-transfer-learning-for-chest.html
- Sellergren AB, Chen C, Nabulsi Z, Li Y, Maschinot A, Sarna A, et al. Simplified Transfer Learning for Chest Radiography Models Using Less Data. Radiology. 2022 Nov;305(2):454–65.
- Xie Q, Luong MT, Hovy E, Le QV. Self-Training With Noisy Student Improves ImageNet Classification [Internet]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. Available from: http://dx.doi.org/10.1109/cvpr42600.2020.01070
- Khosla, Prannay, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. "Supervised contrastive learning." Advances in neural information processing systems 33 (2020): 18661-18673.
- Nabulsi Z, Sellergren A, Jamshy S, Lau C, Santos E, Kiraly AP, et al. Deep learning for distinguishing normal versus abnormal chest radiographs and generalization to two unseen diseases tuberculosis and COVID-19. Sci Rep. 2021 Sep 1;11(1):15523.
- TensorFlow documentation on TFRecord: https://www.tensorflow.org/tutorials/load_data/tfrecord [Accessed: 12 January 2022]
- CXR Foundation Access Form [Internet]. Google Docs. [cited 2022 Dec 15]. Available from: https://docs.google.com/forms/d/e/1FAIpQLSek0P-JSwSfonIiZJlz7gOTbL0lugsDug0FUnMhS1zVzpEKlg/viewform
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
CITI Data or Specimens Only Research
DOI (version 1.0):
DOI (latest version):