Database Open Access

PTB-XL+, a comprehensive electrocardiographic feature dataset

Nils Strodthoff Temesgen Mehari Tobias Schaeffter

Published: April 24, 2023. Version: 1.0.0 <View latest version>


When using this resource, please cite: (show more options)
Strodthoff, N., Mehari, T., & Schaeffter, T. (2023). PTB-XL+, a comprehensive electrocardiographic feature dataset (version 1.0.0). PhysioNet. https://doi.org/10.13026/m5qc-8m53.

Additionally, please cite the original publication:

Strodthoff, N., Mehari, T., Nagel, C. et al. PTB-XL+, a comprehensive electrocardiographic feature dataset. Sci Data 10, 279 (2023). https://doi.org/10.1038/s41597-023-02153-8

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

The PTB-XL+ dataset is a comprehensive feature dataset that supplements the PTB-XL ECG dataset. It includes ECG features extracted via two commercial and one open-source algorithm in a harmonized format as well as median beats or, where applicable, fiducial points extracted by the respective algorithms. In addition, it provides automatic diagnosis statements from one commercial ECG analysis algorithm at different processing levels that are ready to be used for training and evaluation of machine learning models.


Background

The importance of machine learning (ML) and in particular most recently deep learning methods in the field of automatic ECG analysis is growing steadily, supported by the release of large public datasets. However, current datasets lack important metadata such as ECG features, which have been developed over the last hundred years and still form the basis of most automated ECG analysis algorithms that are built into ECG devices and also form the basis for cardiologists' decision rules. Such ECG features, although available in sophisticated commercial software, are not accessible to the general public. To address this issue, we release ECG features from two leading commercial algorithms and from an open source implementation. In addition, we publish a set of automatic diagnostic statements from a commercial ECG analysis software at different processing levels, which are directly applicable for training and evaluation of ML models, even in direct comparison to the original human annotation. This supplementary dataset will decisively improve the usability of the PTB-XL dataset [1, 2] in order to turn it into a reference dataset for machine learning methods related to ECG data.


Methods

We leverage ECG features from three sources, two of which are commercial state-of-the-art ECG analysis packages, the University of Glasgow ECG Analysis Program (Uni-G) and GE Healthcare's Marquette™ 12SL™ (12SL), which are distributed in millions of ECG devices world-wide and one open-source package (ECGDeli). The commercial algorithms follow a similar approach. The first step constitutes of the calculation of a median beat. Most features are extracted from this median beat and are then used to predict diagnostic statements, see [3] for details on the Uni-G approach and [4] for details on the 12SL™ algorithm. However, while the PTB-XL+ dataset contains the full feature sets of both commercial algorithms, it does not include automatic diagnosis statements from Uni-G, due to usage restrictions. Both feature extraction algorithms are closed source and only accessible with special equipment. Nevertheless, the decision rules followed by 12SL™ algorithm are available from the Physician’s manual [4].

The open-source package ECGDeli [5] follows a different approach. First, fiducial points of the signal are determined, which are then further processed to obtain the features. The ECG features are directly computed for each available beat. As the execution of ECGDeli relies on Matlab as proprietary software, its range of potential users is limited, although the software is publicly available. In the dataset, we report only the median and the interquartile range across beats as well as the total count of beats that were consider for each respective feature. The package authors cannot guarantee the generalizability of these algorithms to a wide range of pathologies, yet the features for the full dataset have been included in the dataset for completeness.


Data Description

The ECG features as the core of the dataset are supplemented by further metadata such as median beats or fiducial points and automatic diagnostic statements provided by one of the most widely used commercial ECG analysis algorithms, the Marquette™ 12SL™ algorithm.

ECG Features

To allow the users to use the different feature sets interchangeably, we mapped the features to a common naming scheme and converted them into compatible units, using `mV` for amplitudes and `ms` for intervals as base units. For each feature set, we provide the features in the form of a tabular format in the csv files ./features/12sl_features.csv, ./features/unig_features.csv and ./features/ecgdeli_features.csv with a single row per ECG record, a column for the PTB-XL ECG identifier and a column for each ECG feature. The features themselves are listed and described in ./features/feature_description.csv along with mappings to standardized LOINC IDs, where available.

Median Beats

The median beats obtained by the two commercial algorithms Uni-G and 12SL™ are stored in the WaveForm DataBase (WFDB) format with 16 bit precision at a resolution of 1μV/LSB and a sampling frequency of 500Hz in the directories ./median_beats/unig/ and ./median_beats/12sl/, where the filenames contain the corresponding PTB-XL ECG identifiers.

Fiducial points

The fiducial points were obtained by ECGDeli [5] and can be found in ./fiducial_points/ecgdeli. The file name starts with the corresponding PTB-XL ECG identifier and ends with the respective lead. The data is stored in WFDB-compatible attribution files.

Diagnostic statements

The diagnostic statements are stored in ./labels/: ./labels/12sl_statements.csv contains the diagnostic labels obtained from 12SL™ (statements: original statements; statements_ext: statements separated into primary and supportive statements; statements_ext_snomed: statements_ext after mapping to SNOMED according to ./labels/mapping/12slv23ToSNOMED.csv and up-propagation in the SNOMED label hierarchy). For completeness, we present an analogous file for the original PTB-XL statements in ./labels/ptbxl_statements.csv (scp_codes: original annotations; scp_codes_ext: scp_codes supplemented by other metadata available as separate columns in the PTB-XL dataset such as acute/old myocardial infarction or heart axis; scp_codes_ext_snomed: scp_codes_ext after mapping to SNOMED according to ./labels/mapping/ptbxlToSNOMED.csv and up-propagation in the SNOMED label hierarchy). A description of the SNOMED statements along with a column which statements should be used for model training/evaluation, i.e., after removing too unspecific or perfectly correlating statements, is provided in ./labels/snomed_description.csv. We stress again that this allows for the first time to train/evaluate models on 12SL statements directly and/or even directly compare to human PTB-XL annotations. To allow to modify and reapply the label mapping (from ./labels/mapping/12slv23ToSNOMED.csv and ./labels/mapping/ptbxlToSNOMED.csv ), we provide a Python script under ./labels/mapping/apply_snomed_mapping.py.


Usage Notes

For the convenience of the user, we publish code to demonstrate how to train models based on ECG features and how to make use of the different label sets to train and evaluate and compare machine learning models following benchmarking criteria established in our associated publication [6, 7]. This should provide a good starting point for own investigations of the data set. In our opinion, the availability of the additional features significantly increases the usability of the PTB-XL data set, as ML models can be trained on features and combinations of raw data and features for the first time on a larger scale. Furthermore, the quality of features from different feature sets can be investigated and the strengths and weaknesses of diagnostic statements provided by modern ECG analysis software can be examined in more detail.


Release Notes

v1.0.0 Initial release of the dataset


Ethics

This is a derivative dataset of the PTB-XL project. The Institutional Ethics Committee approved the publication of the source data in an open-access database (PTB-2020-1).


Conflicts of Interest

The contributors have no conflicts of interest to declare.


References

  1. Wagner, P. et al. PTB-XL, a large publicly available electrocardiography dataset. Scientific Data 7, 154 (2020).
  2. Wagner, P., Strodthoff, N., Bousseljot, R.-D., Samek, W. & Schaeffter, T. PTB-XL, a large publicly available electrocardiography dataset (2020).
  3. Macfarlane, P., Devine, B. & Clark, E. The university of glasgow (uni-g) ecg analysis program. In Computers in Cardiology, 2005, 451–454 (2005).
  4. GE Healthcare, Marquette 12SL ECG Analysis Program: Physician’s Guide. General Electric Company. 2056246-002C (2019).
  5. Pilia, N., Nagel, C., Lenis, G., Becker, S., Dössel, O., Loewe, A. ECGdeli - an open source ECG delineation toolbox for MATLAB. SoftwareX 13, 100639 (2021).
  6. Strodthoff, N., Wagner, P., Schaeffter, T. & Samek, W. Deep learning for ECG analysis: Benchmarks and insights from PTB-XL. IEEE Journal of Biomedical and Health Informatics 25, 1519–1528 (2021).
  7. Code to train and evaluate models based on PTB-XL+ ECG features. Zenodo. https://zenodo.org/record/7323537#.Y3OTJr7MKEA

Parent Projects
PTB-XL+, a comprehensive electrocardiographic feature dataset was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.

License (for files):
Creative Commons Attribution 4.0 International Public License

Discovery
Corresponding Author
You must be logged in to view the contact information.
Versions

Files

Total uncompressed size: 2.0 GB.

Access the files

Visualize waveforms

Folder Navigation: <base>/labels/mapping
Name Size Modified
Parent Directory
12slv23ToSNOMED.csv (download) 27.6 KB 2022-11-16
apply_snomed_mapping.py (download) 10.9 KB 2022-11-16
ptbxlToSNOMED.csv (download) 13.1 KB 2022-11-16