Database Credentialed Access
Published: Aug. 2, 2019. Version: 1.0.0
Johnson, A., Pollard, T., Mark, R., Berkowitz, S., Horng, S. (2019). MIMIC-CXR Database. PhysioNet. doi:10.13026/C2JT1QPlease include the standard citation for PhysioNet:
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.
The MIMIC Chest X-ray (MIMIC-CXR) Database v1.0.0 is a large publicly available dataset of chest radiographs with structured labels. The dataset contains 371,920 images corresponding to 224,548 radiographic studies performed at the Beth Israel Deaconess Medical Center in Boston, MA. The dataset is de-identified to satisfy the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Protected health information (PHI) has been removed. The dataset is intended to support a wide body of research in medicine including image understanding, natural language processing, and decision support.
Chest radiography is a common imaging modality used to assess the thorax and the most common medical imaging study in the world. Chest radiographs are used to identify acute and chronic cardiopulmonary conditions, verify that devices such as pacemakers, central lines, and tubes are correctly positioned, and to assist in related medical workups. In the U.S., the number of radiologists as a percentage of the physician workforce is decreasing . and the geographic distribution of radiologists favors larger, more urban counties . Delays and backlogs in timely medical imaging interpretation have demonstrably reduced care quality in such large health organizations as the U.K. National Health Service  and the U.S. Department of Veterans Affairs . The situation is even worse in resource-poor areas, where radiology services are extremely scarce. As of 2015, only 11 radiologists served the 12 million people of Rwanda , while the entire country of Liberia, with a population of four million, had only two practicing radiologists . Accurate automated analysis of radiographs has the potential to improve the efficiency of radiologist workflow and extend expertise to under-served regions.
The creation of MIMIC-CXR required handling three distinct data modalities: electronic health record data, images (chest radiographs), and natural language (free-text reports). These three modalities were processed approximately independently and ultimately combined to create the database.
Electronic health record
The BIDMC operates a locally built electronic health record (EHR) to store and process clinical data. A collection of images associated with a single report is referred to as a study. We queried the BIDMC EHR for chest x-ray studies made in the emergency department between 2011 - 2016, and extracted the set of patient identifiers associated with these studies. We subsequently extracted all chest x-ray studies for this set of patients between 2011 - 2016.For anonymization purposes, two sets of random identifiers were generated. First, a random identifier was generated for each patient in the range 10,000,000 - 19,999,999, which we refer to as the
subject_id. Each patient was also assigned a date shift which mapped their first index admission year to a year between 2100 - 2200. This ensures anonymity of the data while preserving the relative chronology of patient information, which is crucial for appropriate processing of the data. Second, each report was associated with a single unique identifier. We generated a random identifier for each study in the range 50,000,000 - 59,999,999. We refer to the anonymized study identifier as the
study_id. As multiple images may be associated with the same study (e.g. one frontal and one lateral image), multiple images in MIMIC-CXR have the same
Chest radiographs were sourced from the hospital picture archiving and communication system (PACS) in Digital Imaging and Communications in Medicine (DICOM) format. DICOM is a common format for medical images which facilitates interoperability of many distinct medical devices. Put simply, the DICOM format contains structured meta-data associated with one or more images, and the DICOM standard stipulates strict rules around the structure of this information.
The acquired DICOM images contained PHI which required removal for conformance with HIPAA. Images sometimes contain ``burned in'' annotations: areas where pixel values have been modified after image acquisition in order to display text. Annotations contain relevant information including: image orientation, anatomical position of the subject, timestamp of image capture, and so on. The resulting image, with textual annotations encoded within the pixel themselves, is then transferred from the modality to PACS. Since the annotations are applied at the modality level, it is impossible to recover the original image without annotations.
Due to the burned in annotations, image pixel values required de-identification. A custom algorithm was developed which removed dates and patient identifiers, but retained radiologically relevant information such as orientation. The algorithm applied an ensemble of image preprocessing and optical character recognition approaches to detect text within an image. Images were binarized to enhance contrast of the text with the background. Three thresholds were used to binarize the image: one based off maximum pixel intensity, one based upon minimum pixel intensity, and one using a fixed pixel value frequently used by the modality when adding text. Optical character recognition was performed using the tesseract library v3.05.02 . Text was classified as PHI using a set of custom regular expressions which aimed to be conservative in removal of text and allow for errors in the optical character recognition. If a body of text was suspected to be PHI, all pixel values in a bounding box encompassing the PHI were set to black.
Subsequent to pixel value de-identification, we manually reviewed 6,900 radiographs for PHI. Each image was reviewed by two independent annotators. 180 images were identified for a secondary consensus review; none of which ultimately had PHI. The most common causes for annotators to request consensus review were: (1) existence of a support device such as a pacemaker, (2) text identifying in-hospital location (e.g. ``MICU''), and (3) obscure text relating to radiograph technique (e.g. ``prt rr slot 11'').
Labeling of the radiology reports
Radiology reports at the source hospital are semi-structured, with radiologists documenting their interpretations in titled sections. The structure of these reports are generally consistent through the use of standardized documentation templates, though can drift over time as the template changed. There can also be some inter-reporter variability as the structure of the reports are not enforced by the user interface and can be overridden by the user.
The two primary sections of interest are findings; a natural language description of the important aspects in the image, and impression; a short summary of the most immediately relevant findings. Labels for the images were derived from either the impression section, the findings section (if impression was not present), or the final section of the report (if neither impression nor findings sections were present). Of the total 227,943 reports, 82.4% had an impression section, 12.5% had a findings section, and 5.1% did not have an impression or findings section.
Labels were determined using the open source CheXpert labeler . CheXpert is a rule based classifier which proceeds in three stages: (1) extraction, (2) classification, and (3) aggregation. In the extraction stage, all mentions of a label are identified, including alternate spellings, synonyms, and abbreviations (e.g. for pneumothorax, the words "pneumothoraces" and "ptx" would also be captured). Mentions are then classified as positive, uncertain, or negative using local context. Finally, aggregation is necessary as there may be multiple mentions of a label. Priority is given to positive mentions, followed by uncertain mentions, and lastly negative mentions. If a positive mention exists, then the label is positive. Conversely, if a negative and uncertain mention exist, the label is uncertain. These stages are used to define all labels except "No Finding", which is only positive if all other labels except "Support Devices" are negative or unmentioned. More detail is provided in the CheXpert article .
MIMIC-CXR v1.0.0 contains:
- A set of 10 files with training set images (each approximately 55 GB) and one file with validation set images (4.2GB)
- train.csv.gz - a compressed file listing all images in the training set with useful metadata and labels using the CheXpert labeler
- valid.csv.gz - as above, for the validation set
Images are provided in 10 individual archive files created with tar. The organization of the files within the tarball is as follows:
- Each tar file is of the form train_p0[0-9].tar, e.g. train_p00.tar
- These files contain images for patient IDs which begin with the suffix character, e.g. train_p10.tar contains all patients p10000000 - p11000000
The train.csv.gz and validate.csv.gz files are compressed comma delimited value files. The first two columns are:
- path - A relative path to the image described by this row
- view - A simplified view position for the patient: 'frontal', 'lateral', or 'other'.
The remaining columns are labels assigned by the CheXpert labeler:
- No Finding
- Enlarged Cardiomediastinum
- Airspace Opacity
- Lung Lesion
- Pleural Effusion
- Pleural Other
- Support Devices
Use of the dataset is free to all researchers after signing of a data use agreement which stipulates, among other items, that (1) the user will not share the data, (2) the user will make no attempt to reidentify individuals, and (3) any publication which makes use of the data will also make the relevant code available.
This is the first public release of MIMIC-CXR, which includes (1) JPG formatted image files, and (2) structured labels extracted with a publicly available natural language processing tool.
We would like to acknowledge the Stanford Machine Learning Group and the Stanford AIMI center for their help in running the chexpert labeler and for their insight into the work; in particular we would like to thank Jeremy Irvin, Pranav Rajpurkar, and Matthew Lungren. We would also like to acknowledge the Beth Israel Deaconess Medical Center for their continued collaboration.
This work was supported by grant NIH-R01-EB017205 from the National Institutes of Health. The MIT Laboratory for Computational Physiology received funding from Philips Healthcare to create the database described in this paper.
Conflicts of Interest
Philips Healthcare supported the creation of this resource.
- Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund,Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
- Ray Smith. An overview of the tesseract OCR engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), volume 2, pages 629–633. IEEE, 2007.
- Farah S Ali, Samantha G Harrington, Stephen B Kennedy, and Sarwat Hussain. Diagnostic radiology in Liberia: a country report. Journal of Global Radiology, 1(2):6, 2015.
- David A Rosman, Jean Jacques Nshizirungu, Emmanuel Rudakemwa, Crispin Moshi, Jean de Dieu Tuyisenge,Etienne Uwimana, and Louise Kalisa. Imaging in the land of 1000 hills: Rwanda radiology country report. Journal of Global Radiology, 1(1):5, 2015.
- Sarah Bastawrous and Benjamin Carney. Improving patient safety: Avoiding unread imaging exams in the nationalva enterprise electronic health record. Journal of digital imaging, 30(3):309–313, 2017.
- Abi Rimmer. Radiologist shortage leaves patient care at risk, warns royal college. BMJ: British Medical Journal(Online), 359, 2017.
- Andrew B Rosenkrantz, Wenyi Wang, Danny R Hughes, and Richard Duszak Jr. A county-level analysis of theus radiologist workforce: physician supply and subspecialty characteristics. Journal of the American College of Radiology, 15(4):601–606, 2018.
- Andrew B Rosenkrantz, Danny R Hughes, and Richard Duszak Jr. The US radiologist workforce: an analysis of temporal and geographic variation by using large national datasets. Radiology, 279(1):175–184, 2015.
Only PhysioNet credentialed users who sign the specified DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0