Database Restricted Access

CheXchoNet: A Chest Radiograph Dataset with Gold Standard Echocardiography Labels

Pierre Elias Shreyas Bhave

Published: March 20, 2024. Version: 1.0.0

When using this resource, please cite: (show more options)
Elias, P., & Bhave, S. (2024). CheXchoNet: A Chest Radiograph Dataset with Gold Standard Echocardiography Labels (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Bhave S, Rodriguez V, Poterucha T, Mutasa S, Aberle D, Capaccione KM et al. Deep learning to detect left ventricular structural abnormalities in chest X-rays. European Heart Journal 2024; 10.1093/eurheartj/ehad782

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Existing chest radiograph datasets, such as CheXpert and ChestX-ray14, have driven the development of new machine learning approaches to achieve expert or near-expert level performance on a variety of tasks. The primary focus of models developed using these datasets has been to replicate human-level performance by training on labels computationally extracted from radiology reports. We propose a different paradigm: pair an existing diagnostic test with labels from a more accurate, higher fidelity diagnostic test. This approach seeks to ask whether data from a cheaper, lower fidelity diagnostic test contains information for detection of pathologies using more accurate, gold standard labels. In the context of chest X-rays, a good example is the radiologic comment of cardiomegaly, a catch-all term or an abnormally enlarged heart. Cardiomegaly is known to be poorly predictive of cardiac disease and does not trigger meaningful clinical action. Instead, we can pair chest X-rays with gold standard structural heart disease labels derived from echocardiograms conducted on the same patients. This resource contains 71,589 unique chest X-rays from 24,689 different patients paired with key echocardiography measurements indicative of left ventricular hypertrophy and dilated left ventricle, pathologies which occur during early stage heart failure. The data also includes information about the relative times of the chest X-rays, the age/sex of the patient at the time of recording, and related metadata information. This data can be used as a resource for the community to build novel approaches to detect clinically actionable labels.


Early detection of structural changes in the heart is crucial for improving outcomes in heart failure patients, but initial signs can be nonspecific, leading to delays in diagnosis [1,2]. Millions of individuals with left ventricular (LV) abnormalities remain undiagnosed, which worsens prognosis [3]. Detecting heart failure earlier, ideally before symptoms manifest, is a key goal in cardiology. While echocardiography is the primary diagnostic tool for LV abnormalities, it's often conducted on a limited patient population with high pre-test probability [4]. In contrast, chest X-rays (CXRs) are cheaper and more widely performed but haven't been extensively used for cardiac pathology detection. Thus, CXRs could serve as a valuable data source to develop scalable screening tools using machine learning methods. The majority of existing datasets pair CXRs with radiology reports, or labels extracted from them. We introduce the first large-scale dataset of its kind, which pairs each chest X-ray with corresponding gold standard information from echocardiograms conducted on the same patient, enabling more specific clinically actionable detection tasks. 


Data Source and Cohort Construction 

All patients who had both a CXR and an echocardiogram conducted at Columbia University Irving Medical Center (CUIMC) from January 2013 to August 2018 were identified. CXRs in their full resolution were extracted in DICOM format and filtered to only include posteroanterior (PA) films. All portable anteroposterior (AP) films were excluded to prevent label leakage from the model potentially associating portable films with patients more likely to have cardiac pathology. CXR metadata were used to identify demographic information including age and sex. The echocardiograms were accessed through the Syngo Dynamics system (Siemens Healthineers, Malvern, PA). For each echocardiogram, the following continuous measures were extracted from the parasternal long-axis view using our enterprise data warehouse that stores finalized reports: interventricular septal thickness at end diastole (IVSd), left ventricular internal diameter at end diastole (LVIDd) and left ventricular posterior wall distance at end diastole (LVPWd). Each of these continuous measurements corresponds to After extraction, only CXRs for patients with at least one echocardiogram conducted within 12 months (i.e. before or after CXR) were retained in the final dataset. Using the echocardiographic measurements, labels for IVSd, LVIDd, and LVPWd were assigned to each CXR. For any CXR with multiple echocardiograms conducted within 12 months, the maximum of each echocardiographic measurement across all studies was taken. The final dataset contains 71,589 unique CXRs conducted on 24,689 different patients. 


IVSd, LVIDd and LVPWd are continuous labels measured in centimeters. They represent measurements of different structures in the left ventricle of the heart; LVIDd measures the overall size of the left ventricle, LVPWd measures the thickness of the left walls of the heart, and IVSd measures the septal wall thickness (i.e., the wall of tissue that separates the heart's right and left sides). These measurements are used to diagnose left ventricular structural abnormalities. If they are higher than the normal ranges (depending on sex), then a patient is diagnosed with specific kinds of left ventricular structural abnormalities, namely severe left ventricular hypertrophy (SLVH) and dilated left ventricle (DLV). Using thresholds from the clinical guidelines [5], we converted these continuous measurements to binary disease diagnoses for SLVH, DLV and a Composite label (which represents the presence of either SLVH or DLV). 

Processing Steps

We applied several standard preprocessing procedures to make the data tractable in size and easy to use for downstream applications. CXRs were downsampled to a 224-by-224 pixel image using bicubic interpolation to ensure images were the same size. To improve the contrast of images, contrast-limited adaptive histogram equalization was applied to each image. The images were then saved in jpeg format.

De-identification Steps

For each image, there is associated metadata which we include in a separate file. We run a de-identification procedure for the metadata of each image. Patient identifiers are hashed to a 32 character sequence. A random anchor time is generated for each patient. Using this hidden anchor time, a time shift is provided in days so that the time between CXRs for the same patient may be computed. Finally, a de-identified file name is generated which includes the 32 character patient identifier and an 8 character file identifier. This refers to the name of the jpg file. Ages are truncated to be between 18 and 90.

To ensure that no protected health information (PHI) was present in individual images, we conducted manual review of the processed versions of the images. It was confirmed that there were no PHI across images, including images with pacemakers or implantable defibrillators (icd).

Data Description

The data consists of a metadata file and a directory which contains all of the chest X-rays. 

The metadata file contains a row for each chest X-ray with corresponding information. It contains a patient identifier, a shifted time to compare the relative times between chest X-rays conducted on the same patient and the path to the filename of the corresponding jpg image. The file names for each jpg consist of the patient identifier concatenated with a unique file-specific hash. 

In addition, this metadata file contains columns for pixel spacing of the image, age/sex of patient, the continuous echocardiographic data (ivsd, lvpwd, lvidd), and binary labels for specific cardiac pathologies (slvh, dlv, composite slvh/dlv). Finally, there are labels for whether the patient has received a heart/lung transplant in the past, or has had a pacemaker/icd implanted. 

Usage Notes

In the companion paper to this data release, we built a method to assess how well a deep learning model could detect three pathology labels (SLVH, DLV, Composite SLVH/DLV) and demonstrated that the model could achieve strong performance. Users of this data may train models using the same pathology labels as we did, or even create their own labels using the continuous echocardiography values. These values may be mapped to other pathologies as well, such as left ventricular hypertrophy (i.e., mild/moderate form of the disease). Please see the project website for additional information.

Release Notes

1.0.0 Initial release of the dataset.


The Institutional Review Board of Columbia University Medical Center granted approval for the project (IRB-AAAS7388). The need for individual patient consent was waived, as the project had no bearing on clinical care, and all relevant patient data had been previously collected. Consequently, the study posed minimal risk to the participants.

Conflicts of Interest

All authors report no conflicts of interest.


  1. d'Arcy, J. L., Coffey, S., Loudon, M. A., Kennedy, A., Pearson-Stuttard, J., Birks, J., ... & Prendergast, B. D. (2016). Large-scale community echocardiographic screening reveals a major burden of undiagnosed valvular heart disease in older people: the OxVALVE Population Cohort Study. European heart journal, 37(47), 3515-3522.
  2. Lang RM, Badano LP, Mor-Avi V, Afilalo J, Armstrong A, Ernande L, Flachskampf FA, Foster E, Goldstein SA, Kuznetsova T, Lancellotti P. Recommendations for cardiac chamber quantification by echocardiography in adults: an update from the American Society of Echocardiography and the European Association of Cardiovascular Imaging. European Heart Journal-Cardiovascular Imaging. 2015 Mar 1;16(3):233-71.
  3. Maron, M. S., Hellawell, J. L., Lucove, J. C., Farzaneh-Far, R., & Olivotto, I. (2016). Occurrence of clinically diagnosed hypertrophic cardiomyopathy in the United States. The American journal of cardiology, 117(10), 1651-1654.
  4. Alexander, K. M., Orav, J., Singh, A., Jacob, S. A., Menon, A., Padera, R. F., ... & Dorbala, S. (2018). Geographic disparities in reported US amyloidosis mortality from 1979 to 2015: potential underdetection of cardiac amyloidosis. JAMA cardiology, 3(9), 865-870.
  5. Cook, C. H., Praba, A. C., Beery, P. R., & Martin, L. C. (2002). Transthoracic echocardiography is not cost-effective in critically ill surgical patients. Journal of Trauma and Acute Care Surgery, 52(2), 280-284.


Access Policy:
Only registered users who sign the specified data use agreement can access the files.

License (for files):
PhysioNet Restricted Health Data License 1.5.0

Data Use Agreement:
PhysioNet Restricted Health Data Use Agreement 1.5.0

Corresponding Author
You must be logged in to view the contact information.