Database Open Access
PTB-XL, a large publicly available electrocardiography dataset
Patrick Wagner , Nils Strodthoff , Ralf-Dieter Bousseljot , Wojciech Samek , Tobias Schaeffter
Published: April 24, 2020. Version: 1.0.1 <View latest version>
When using this resource, please cite:
(show more options)
Wagner, P., Strodthoff, N., Bousseljot, R., Samek, W., & Schaeffter, T. (2020). PTB-XL, a large publicly available electrocardiography dataset (version 1.0.1). PhysioNet. https://doi.org/10.13026/x4td-x982.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets.
The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties.
Background
The waveform data underlying the PTB-XL ECG dataset was collected with devices from Schiller AG over the course of nearly seven years between October 1989 and June 1996. With the acquisition of the original database from Schiller AG, the full usage rights were transferred to the PTB. The records were curated and converted into a structured database within a long-term project at the Physikalisch-Technische Bundesanstalt (PTB). The database was used in a number of publications, see e.g. [1,2], but the access remained restricted until now. The Institutional Ethics Committee approved the publication of the anonymous data in an open-access database (PTB-2020-1). During the public release process in 2019, the existing database was streamlined with particular regard to usability and accessibility for the machine learning community. Waveform and metadata were converted to open data formats that can easily processed by standard software.
Methods
Data Acquisition
1. Raw signal data was recorded and stored in a proprietary compressed format. For all signals, we provide the standard set of 12 leads (I, II, III, AVL, AVR, AVF, V1, ..., V6) with reference electrodes on the right arm.
2. The corresponding general metadata (such as age
, sex
, weight
and height
) was collected in a database.
3. Each record was annotated with a report string (generated by cardiologist or automatic interpretation by ECG-device) which was converted into a standardized set of SCP-ECG statements (scp_codes
). For most records also the heart’s axis (heart_axis
) and infarction stadium (infarction_stadium1
and infarction_stadium2
, if present) were extracted.
4. A large fraction of the records was validated by a second cardiologist.
5. All records were validated by a technical expert focusing mainly on signal characteristics.
Data Preprocessing
ECGs and patients are identified by unique identifiers (ecg_id
and patient_id
). Personal information in the metadata, such as names of validating cardiologists, nurses and recording site (hospital etc.) of the recording was pseudonymized. The date of birth only as age at the time of the ECG recording, where ages of more than 89 years appear in the range of 300 years in compliance with HIPAA standards. Furthermore, all ECG recording dates were shifted by a random offset for each patient. The ECG statements used for annotating the records follow the SCP-ECG standard [3].
Data Description
In general, the dataset is organized as follows:
ptbxl
├── ptbxl_database.csv
├── scp_statements.csv
├── records100
│ ├── 00000
│ │ ├── 00001_lr.dat
│ │ ├── 00001_lr.hea
│ │ ├── ...
│ │ ├── 00999_lr.dat
│ │ └── 00999_lr.hea
│ ├── ...
│ └── 21000
│ ├── 21001_lr.dat
│ ├── 21001_lr.hea
│ ├── ...
│ ├── 21837_lr.dat
│ └── 21837_lr.hea
└── records500
├── 00000
│ ├── 00001_hr.dat
│ ├── 00001_hr.hea
│ ├── ...
│ ├── 00999_hr.dat
│ └── 00999_hr.hea
├── ...
└── 21000
├── 21001_hr.dat
├── 21001_hr.hea
├── ...
├── 21837_hr.dat
└── 21837_hr.hea
The dataset comprises 21837 clinical 12-lead ECG records of 10 seconds length from 18885 patients, where 52% are male and 48% are female with ages covering the whole range from 0 to 95 years (median 62 and interquantile range of 22). The value of the dataset results from the comprehensive collection of many different co-occurring pathologies, but also from a large proportion of healthy control samples. The distribution of diagnosis is as follows, where we restrict for simplicity to diagnostic statements aggregated into superclasses (note: sum of statements exceeds the number of records because of potentially multiple labels per record):
#Records | Superclass | Description |
---|---|---|
9528 | NORM | Normal ECG |
5486 | MI | Myocardial Infarction |
5250 | STTC | ST/T Change |
4907 | CD | Conduction Disturbance |
2655 | HYP | Hypertrophy |
The waveform files are stored in WaveForm DataBase (WFDB) format with 16 bit precision at a resolution of 1μV/LSB and a sampling frequency of 500Hz (records500/
). For the user’s convenience we also release a downsampled versions of the waveform data at a sampling frequency of 100Hz (records100/
).
All relevant metadata is stored in ptbxl_database.csv
with one row per record identified by ecg_id
. It contains 28 columns that can be categorized into:
1. Identifiers: Each record is identified by a unique ecg_id
. The corresponding patient is encoded via patient_id
. The paths to the original record (500 Hz) and a downsampled version of the record (100 Hz) are stored in filename_hr
and filename_lr
.
2. General Metadata: demographic and recording metadata such as age, sex, height, weight, nurse, site, device and recording_date
3. ECG statements: core components are scp_codes
(SCP-ECG statements as a dictionary with entries of the form statement: likelihood
, where likelihood is set to 0 if unknown) and report (report string). Additional fields are heart_axis
, infarction_stadium1
, infarction_stadium2
, validated_by
, second_opinion
, initial_autogenerated_report
and validated_by_human
.
4. Signal Metadata: signal quality such as noise (static_noise
and burst_noise
), baseline drifts (baseline_drift
) and other artifacts such as electrodes_problems
. We also provide extra_beats
for counting extra systoles and pacemaker for signal patterns indicating an active pacemaker.
5. Cross-validation Folds: recommended 10-fold train-test splits (strat_fold
) obtained via stratified sampling while respecting patient assignments, i.e. all records of a particular patient were assigned to the same fold. Records in fold 9 and 10 underwent at least one human evaluation and are therefore of a particularly high label quality. We therefore propose to use folds 1-8 as training set, fold 9 as validation set and fold 10 as test set.
All information related to the used annotation scheme is stored in a dedicated scp_statements.csv
that was enriched with mappings to other annotation standards such as AHA, aECGREFID, CDISC and DICOM. We provide additional side-information such as the category each statement can be assigned to (diagnostic, form and/or rhythm). For diagnostic statements, we also provide a proposed hierarchical organization into diagnostic_class
and diagnostic_subclass
.
Usage Notes
In example_physionet.py
we provide a minimal usage example that shows how to load waveform data (numpy-arrays X_train
and X_test
) and labels (y_train
and y_test
) making use of the proposed train-test split. For illustration, we use diagnostic subclass statements as labels based on the assignments in scp_statements.csv
.
Release Notes
1.0.1 Fixed mismatching IDs between waveform data and metadata. Same content as the original release.
1.0.0 Initial release of the dataset.
Acknowledgements
We thank Dr. Lothar Schmitz for numerous annotations and providing medical expertise and Dr. Hans Koch for actuating and managing the preparation of the original database. This work was supported by the Bundesministerium für Bildung und Forschung (BMBF) through the Berlin Big Data Center under Grant 01IS14013A and the Berlin Center for Machine Learning under Grant 01IS18037I and by the EMPIR project 18HLT07 MedalCare. The EMPIR initiative is cofunded by the European Union's Horizon 2020 research and innovation program and the EMPIR Participating States.
Conflicts of Interest
References
- Bousseljot, R., Kreiseler, D. (2000). "Waveform recognition with 10,000 ECGs". Computers in Cardiology 27, 331–334.
- SO Central Secretary (2009). "Health informatics – Standard communication protocol– Part 91064: Computer-assisted electrocardiography". Standard ISO 11073-91064:2009, International Organization for Standardization, Geneva.
- Bousseljot, R., Kreiseler, D., Schnabel, A. (1995). "Nutzung der EKG-Signaldatenbank CARDIODAT der PTB über das Internet". Biomedizinische Technik/Biomedical Engineering 317–318.
Access
Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.
License (for files):
Creative Commons Attribution 4.0 International Public License
Discovery
DOI (version 1.0.1):
https://doi.org/10.13026/x4td-x982
DOI (latest version):
https://doi.org/10.13026/6sec-a640
Topics:
ptb-xl
ptb
ecg
electrocardiography
Corresponding Author
Files
Total uncompressed size: 3.0 GB.
Access the files
- Download the ZIP file (1.7 GB)
- Access the files using the Google Cloud Storage Browser here. Login with a Google account is required.
-
Access the data using the Google Cloud command line tools (please refer to the gsutil
documentation for guidance):
gsutil -m -u YOUR_PROJECT_ID cp -r gs://ptb-xl-1.0.1.physionet.org DESTINATION
-
Download the files using your terminal:
wget -r -N -c -np https://physionet.org/files/ptb-xl/1.0.1/
-
Download the files using AWS command line tools:
aws s3 sync --no-sign-request s3://physionet-open/ptb-xl/1.0.1/ DESTINATION