Choosing a project type
When submitting a project to PhysioNet you will be asked to select one of the following project types:
- Database: Research data with significant potential for reuse by the research community. This may include data that enables published studies to be reproduced, data for benchmarking algorithms, and data that supports novel investigations.
- Software: Software that has been developed for research applications.
- Challenge: Description of a challenge for the research community. Files such as datasets and software may be included as part of the challenge.
- Model: An implementation of a statistical or machine learning model with potential for reuse by the research community. Typically models will be created by a training process and may have dependencies on specific computational frameworks.
Creating the project metadata
To help the community to reuse your shared resources, we require a detailed description. The information that you provide should focus on the resource and how it might be reused. During the submission process you are asked to provide information such as a title, an abstract for distribution to search indexes, and context describing the manner in which the resource was created. Further details are outlined below:
- Title: Your title should be no longer than 200 characters. Avoid acronyms and abbreviations where possible. Also avoid leading with "The". Only letters, numbers, spaces, underscores, and hyphens are allowed.
- If your dataset is derived from MIMIC and you would like to use the MIMIC acronym, please include the letters "Ext" (for example, MIMIC-IV-Ext-YOUR-DATASET"). Ext may either indicate "extracted" (e.g. a derived subset) or "extended" (e.g. annotations), depending on your use case.
- If the dataset is derived from another dataset, the title must make this clear.
- Abstract: Your abstract must be no longer than 250 words. The focus should be on the resource being shared. If the resource was generated as part of a scientific investigation, relevant information may be provided to facilitate reuse. References should not be included. The abstract should also include a high-level description of the data as well as an overview of the key aims of the project. The abstract may appear in search indexes independently of the full project metadata, so providing detailed information about the content is important.
- Background: Your background should provide the reader with an introduction to the resource. The section should offer context in which the resource was created and outline your motivations for sharing.
- Methods & Technical Implementation: The "Methods" and "Technical Implementation" sections provide details of the procedures used to create your resource including, but not limited to, how the data was collected, any measurement devices, etc. For software, the section may cover aspects such as development process, software design, and description of algorithms. For data, the section may include details such as experimental design, data acquisition, and data processing.
- Content description: Your content (data, software, model) description should describe the resource in detail, outlining how files are structured, file formats, and a description of what the files contain. We also suggest including summary statistics where appropriate (e.g. total number of distinct patients, number of files, types of signals, over what time span was the data collected, etc.).
- Usage notes: This section should provide the reader with information relevant to reuse. Why is this data useful for the community?
- In particular we suggest discussing: (1) how the data has already been used (citing relevant papers); (2) the reuse potential of the dataset; (3) known limitations that users should be aware of when using the resource; and (4) any complementary code or datasets that might be of interest to the user community.
- Ethics: Please provide a statement on the ethics of your work. Think about the project impact and briefly highlight both benefits and risks. Please also add relevant institutional review details here, for example:
- Data collected from human subjects: Please provide a statement that the study protocol was approved by relevant Institutional Review Boards (IRBs) or ethics committees. If human participants gave written informed consent, then please state this.
- Clinical trial data: Please specify trial registration number and registry name.
- Data collected from animals: Please specify the animal care guidelines used in collecting your data. See, for example, this project and this official NIH manual.
- Acknowledgments: In this section, acknowledge the people who helped with the research, but who were not included as co-authors. In addition, provide funding information.
- Conflicts of interest: A statement on potential conflicts of interest is required. If the authors have no conflicts of interest, the section should say "The author(s) have no conflicts of interest to declare".
- Version: The version number of the resource. Semantic versioning is encouraged (major version, minor version, patch version). If unsure, put "1.0.0".
- References: Please use the Vancouver reference style. All citations should be numbered sequentially in the text in square brackets. For example, the first citation [1], the second citation [2], and the third and fourth citations [3,4]. Entries in the reference list should be in the following style: 1. Xu YZ, Geng DC, Mao HQ, Zhu XS, Yang HL (2010). "A comparison of the proximal femoral nail antirotation device and dynamic hip screw in the treatment of unstable pertrochanteric fracture". J Int Med Res. 38: 1266–1275. PMID 20925999.
- Weblinks: Please do not include URLs/weblinks/hyperlinks in the main text. All external resources (including websites, publications, datasets, and GitHub repositories) should be added to the References section and cited in the main text in the style [1], [2-4].
Preparing your project files
PhysioNet publishes content such as data and software for reuse by the research community. We typically do not review and publish content that reports on scientific findings. Scientific findings should be published elsewhere (for example, in a journal or conference). Our goals are to ensure that the content is safe to share and that it is sufficiently well structured and described for it to be a valuable resource for the research community. When submitting a project, you will be asked to upload relevant data and software files. Please review the following guidelines when preparing your files for submission:
- All projects:
- README file: A README file should be included alongside the files. At minimum, the readme should include a title and a brief description of the package content.
- Protected Health Information (PHI): All protected health information must be removed. All dates (except year), including data collection dates, must be date-shifted or removed. The comprehensive guide for de-identifying data can be seen here.
- File naming: All files should be clearly named and must not include spaces (use underscores instead for increased readability) or special characters (e.g. "/","\,"."). Further, filenames should generally be lowercase (exceptions are "special files”, such as the RECORDS and ANNOTATORS files used for waveforms). Please use brevity when naming files (e.g. 1001abp.dat is better than subject_1001_ABP_wave_100Hz.dat).
- File types: All files must be in open-source format and machine readable. Files in proprietary format, such as MatLab, Excel spreadsheet or Microsoft Word document, will not be supported, and must be converted to open-source format. For example, MatLab data, Excel spreadsheets, or Microsoft Word documents can be converted to CSV format. Some suggested formats for data based on its usage can be seen here.
- Data (general):
- Small datasets: Comma-Separated Value (CSV) is a good format for small datasets. CSVs should be formatted according to the RFC 4180 specification.
- Tidy data: Information needed for reuse of the data must be provided. In most cases, tabulated datasets should be structured following the principles of "tidy data". For example, each variable should be in a column and each observation (or case) in a row.
- Data (waveforms):
- WFDB compatibility: In general, high time-resolution data such as ECG and EEG recordings should be stored in a WFDB compatible (or other open-source) format. Details such as gain and baseline should be included in the file headers. For detailed guidance on creating MIT format signal files, see this tutorial, and for EDF format, see this tutorial.
- Build an index file RECORD of your waveform records: Provide a file named RECORDS at the top-level directory in your submission. The RECORDS file must contain a list of all WFDB format records where each row is the name of a WFDB file (without any .hea or .dat extensions) or EDF file (WITH .edf extensions) in your contribution. Example files can be seen here and here. (Note that for EDF files, you need to specify the .edf file extension as part of your file name as seen here.)
- Additional subject data: Information about the subjects can be included either at the bottom of the signal header files or in a separate text file. Preferred information includes: age, gender, height, weight, medications, and diagnoses. If relevant: gestational age.
- Visualize and check your waveforms using LightWave: If you upload the RECORDS file, you will see a link to view your project's waveforms in LightWAVE. Note that the RECORDS file must be at the top-level directory in your submission, and must be named exactly RECORDS for LightWAVE to locate the file. If your WFDB (or EDF) files are organized in a sub-directory in your project, the relative path of the file location must be specified in the RECORDS file. For example, if you have a WFDB file named "subject1_ecg.dat" under a subdirectory "ECG", then a row in the RECORDS file should read: “ECG/subject1_ecg”.
- Check for valid signal types (WFDB format): One tool to check WFDB formatted files for valid signal types is to use the wfdbcheck auxillary function for the WFDB software package. This program will attempt to find errors which have occurred previously in both signal and annotation files. It is not comprehensive, therefore it is recommended that the user reports any errors which may have occurred as well as a description of the new signal type they would like to be added to the signal type dictionary.
- Signal and waveform channel naming conventions: We use standardized signal names and units for all waveform records for consistency across databases. Please name your signals using the following standardized signal names supported on PhysioNet standard signal name list. Details of the format are at the top of wfdbcal, and also in wfdbcal(5). For signal names, upper case should be used where it improves readability (e.g. ABP_Sys is better than abp_sys).
- If your signals are not already in our standard signal name list, please specify the following information so that PhysioNet tools like LightWAVE can display signals with a reasonable plotting scale by default.
- Please define the vertical scale for each signal not already defined in wfdbcal. Specifically, for any signals of types not listed in wfdbcal, please supply additional one-line entries to be added to that calibration file, in a plain text file named "CALIBRATION". Or, if you are not sure how to construct such a line, just let us know what are typical ranges (in physical units) for each of these signal types.
- Annotations or event locations: Annotations and event locations should be provided as WFDB annotation files. You can use the Matlab Toolbox or the Python Toolbox to create annotations. See the command wrann.
- Software:
- Instructions for installation and usage should be clearly documented.
- Dependencies should be indicated in a requirements file or similar.
- Unit tests should be used to demonstrate correct functioning of major features of the software.
- Standard style guidelines should be followed where appropriate (for example, PEP8 for Python).