Why share data?

The NIH's Data Sharing FAQ notes that there are many benefits that can be realized when research data are shared with the scientific community:

To this list, we might add these benefits that accrue to the investigator who shares his or her research data:

Data sharing offers all of these benefits and more. Since 2003, the NIH has required many recipients of its grants to share research data, and many scientific journals consider disclosure of research data to be an essential component of peer review.

Why then, is it so common for a research project to come near to completion without having implemented a data sharing plan? Time runs short, papers have to be written, and the graduate student who was collecting the data (and who is the only one who understands how they are stored) has defended his thesis and is looking for a job. Next to his computer, there's a shoebox labelled "Data for Bob's thesis", and it's full of CDs numbered, let's see ... 1, 2, 3, 7, 8, 8A, 12, and "January (?)". The principal investigator was planning to ask Alice to figure out how Bob organized the data, but there are no more funds left from the grant to pay her, and anyway she's writing the next grant application.

Don't let this happen to your project! We encourage you to think ahead about data sharing. We can help you design and implement a plan for archiving your data now that will make it easy to share them later.

Am I foreiting an intellectual property interest in my data by publicly sharing them?

No. Data are not eligible for copyright protection in most countries, no matter how much time or effort was required to obtain them. (Compilations of data may be copyrightable in some circumstances if the selection process involves a sufficient amount of original creative expression, although the individual components of such compilations remain unprotected by copyright.)

Researchers who reuse data from PhysioNet are expected to follow community norms for scholarly communication and cite those who collected and published the data, just as they would cite a published paper. Any reuse of the data for publication is expected to cite both the data and the original publication from which they were derived. [This question and answer are adapted from the Dryad JDAP FAQ.]

Do I need to include a data sharing plan in my grant application?

Your grant application may require you to include a plan to make the data you collect in the course of your publicly funded research publicly available. For example, the NIH expects that grant applications seeking $500,000 or more in direct costs in any single year will include a plan for data sharing (or state why data sharing is not possible), and generally encourages development and implementation of data sharing plans in all research it funds.

How should I estimate a review date for my project's data?

By making protected data archives available to researchers, PhysioNet aims to make it easier for researchers to develop and share high-quality data sets, and ultimately to increase the quantity, quality, and variety of data available to the research community. It is thus contrary to our goals to devote PhysioNet resources to projects for which we do not have a reasonable expectation that a useful data set will be made available within a time commensurate with the work required to create it and to study it. We do not require researchers to commit to a specific review date for their data, but we give preference in allocating our resources to those who do so.

As a point of reference, the NIH's Data Sharing FAQ notes:

Recognizing that the value of data often depends on their timeliness, data sharing should occur in a timely fashion. NIH expects the timely release and sharing of data to be no later than the acceptance for publication of the main findings from the final data set. This time point will be influenced by the nature of the data collected. Data from small studies can be analyzed and submitted for publication relatively quickly. If data from large epidemiologic or longitudinal studies are collected over several discrete time periods or waves, data should be released in waves as data become available or main findings from waves of the data are published. NIH recognizes that the investigators who collected the data have a legitimate interest in benefiting from their investment of time and effort. NIH continues to expect that the initial investigators may benefit from the first and continuing use, but not from prolonged exclusive use. While NIH also understands that an institution's desire to exercise its intellectual property rights may justify a need to delay disclosure of research findings, a delay of 30 to 60 days is generally viewed as a reasonable period for such activity.

If I store data on PhysioNet, what will happen if PhysioNet discontinues its services?

As noted in the model data sharing plan, PhysioNet services are provided using free and open-source software that may be duplicated by the project to create a fully functional copy (a mirror) of its protected data archive on standard PC hardware. If for any reason it becomes necessary to discontinue its services to the project, PhysioNet agrees to provide at least sixty days' notice to the principal investigator to permit the project to construct such a copy.

We recommend that you create a mirror for your own protection, but you are not obligated to do so. Creating a mirror is an easy and inexpensive way to insure that your data archive remains accessible. A mirror includes the PhysioNet infrastructure (all of the web services we provide to support your use of PhysioNet) and a copy of your protected data archive(s) as well. To create a mirror you will need a networked PC with sufficient storage for your protected data archives (and about 10 GB more for workspace and the operating system and software including the web server and the PhysioNet infrastructure). We provide all of the software needed, including the operating system (GNU/Linux, although you may choose a different version than the one we use if you prefer) and a simple script for setting up your mirror in about an hour.

If you create a mirror, you may set up an automated process to keep it synchronized with PhysioNet at regular intervals, so that improvements and additions that we make in PhysioNet's web services will also be available in your mirror. Synchronization uses rsync for speed and security. Alternatively, your mirror may simply be a "snapshot" of your protected data archive, with or without the PhysioNet infrastructure.

Can PhysioNet host my project's very large data set?

Probably. PhysioBank currently hosts over 50 data sets ranging in size from less than a megabyte to more than a terabyte. We add capacity regularly.

What fees does PhysioNet charge for data sharing?

PhysioNet offers a variety of services to support researchers who wish to share their data; see Data Sharing on PhysioNet for details. Most of these services are available without cost to your project.

For projects that require significant support from PhysioNet staff, support requested by the project and provided by PhysioNet before the review date is billed to the project at $100 per hour (1 hour minimum per incident) after the first two hours. The services that can be requested in this way are:

These support charges can be avoided in most cases. The principal investigator or another member of the project may upload data into the protected data archive using a web browser. This method is feasible even for large data sets and is the method recommended by PhysioNet. Project members can also browse the incremental backups in order to recover lost data. Consultation via email is available without cost.

There is no cost to your project for a protected data archive that can store up to 100 gigabytes, or for a public data archive of any size that we can feasibly accommodate. For projects requiring more protected storage, it can be made available for $500 per increment of up to 2 terabytes, for up to three years (non-refundable, renewable once).

Once you have released your data for review prior to inclusion in PhysioBank, there is no further cost for support or storage.

What will it cost to share my project's data?

Perhaps the most common flaw in planning for data sharing is not budgeting for it. Data sharing doesn't have to be expensive, but it does have a cost. A data sharing plan that is well-integrated with a data backup plan (everyone needs one of those!) may have little or no incremental cost beyond that of backup, and can be expected to result in a useful resource for the research community. A data sharing plan conceived and executed as an afterthought can be expensive and is much less likely to produce a data set that can support followup studies and attract favorable attention to its creator.

For most projects, the only costs associated with data sharing are those incurred in de-identification of the data prior to transmitting the data to PhysioNet Works, and in preparing documentation that describes the data set in sufficient detail to permit its use by other researchers. Uploading data requires a moderate amount of your time and attention, but there is no fee imposed by PhysioNet unless you require assistance.

Why is de-identification necessary in order to share data via PhysioNet?

In the USA, the HIPAA Privacy Rule restricts sharing of data containing protected health information (PHI) of human subjects. Such data can be shared only as so-called limited data sets under a data use agreement (DUA) that forbids redistribution. On the other hand, the HIPAA Privacy Rule also defines a so-called safe harbor rule for creating de-identified data that are exempt from restrictions on sharing. The safe harbor rule requires removal of 18 specific types of information in order to create a de-identified data set from one originally containing PHI.

PhysioNet does not accept contributions of limited data sets containing PHI, since we would not be permitted to redistribute them, and because the original investigators remain responsible for the proper use of any PHI in such data in any event.

We invite contributions of de-identified data sets from researchers who need assurance that their data will be made available reliably and in a way that will encourage their discovery, exploration, and further study by other researchers.

Can you upload my data?

Yes, we can upload your data from portable media or storage devices, or from your FTP or web site if you wish. We charge $100 per hour (1 hour minimum) for any of these services, however, which you can avoid if you upload your data yourself, using your web browser. This is practical even for large data sets.

PhysioNet is able to provide reliable and secure protected data archives to researchers at little or no cost because these archives are self-service facilities that for most projects do not require significant human resources for PhysioNet to support. Uploading data is not technically difficult, but it does require human time and attention, which is usually supplied by the project that generates the data and not by the PhysioNet resource.

Do not send data on physical media without contacting us first!

