We encourage you to set up a mirror of PhysioNet if you wish. If you do so, please use rsync as described below to minimize the impact on other users.
If you simply wish to retrieve a (possibly large) set of files from PhysioNet without downloading them one at a time, you don't need to set up a mirror of PhysioNet; see the PhysioNet FAQ for instructions on downloading an entire PhysioBank database in one step using rsync, or use GNU wget to download any desired selection of files.
PhysioNet is organized in "volumes" that can be mirrored separately. All PhysioNet mirrors provide the "base volume", which contains all of the software, tutorials, reference manuals, publications, and many of the PhysioBank databases. Additional volumes, which are optional for the mirrors, contain very large data sets that don't fit in the base volume. The maximum sizes of the base volume, and volume 2, will not exceed 25 GB each; volumes 3 and 4, 100 GB each; volumes 5 and 6, 2.5 TB each; and volumes 7 and 8, 10 TB each. More than one volume can occupy a single sufficiently large disk partition.
We use and recommend the following configuration for a PhysioNet web server:
If you plan to install and run the PhysioBank ATM on your mirror, we recommend a CPU with a clock speed of 1 GHz or faster, 512 MB or more RAM, and at least 10 GB of additional disk space for the ATM cache. Most PCs built since 2004 meet or exceed these recommendations.
All of the software needed, including the operating system, is freely available open-source software. The cost of suitable hardware can be under US$300; of course it is possible to spend considerably more. If you cannot or do not wish to run Apache under Linux, many other configurations are possible, but we will not be able to help you troubleshoot your setup. Other versions of Linux or Unix, including Mac OS X, should be usable without difficulty. Although we do not recommend or support MS Windows, versions of all of the necessary software are freely available for MS Windows as optional components of Cygwin. The remainder of these notes assume that you are using Fedora 14.
Currently, 100 GB to 3 TB SATA drives are widely available and are usually least expensive per gigabyte. (4 TB drives are beginning to appear at higher cost per gigabyte.) IDE (PATA) drives can also be used in many older PCs. SATA drives, and any drives larger than 127 GB, may require a controller card in PCs made before 2006. Most current PC motherboards include integrated SATA controllers, but many no longer support IDE drives.
Please let us know if you encounter any difficulties with this procedure!
You will need to have root (administrative) privileges for some of the steps below. Once steps 1 and 2 are complete, the remaining numbered steps can be finished in 10 to 15 minutes.
rsync -a physionet.org::mirror-setup mirror-setup
to verify that you are able to communicate with the PhysioNet master server, and to download a few short files for setting up your mirror, which will go into a subdirectory called mirror-setup within the current directory (e.g., /home/pn/mirror-setup). If mirror-setup doesn't exist already, rsync will create it.
cd mirror-setup
If you wish, read through the various files you have downloaded to see how they work, and then run:
./configure
The first time you run it, configure will ask a few questions, but it will remember your answers (in mirror.conf) and will not ask them again if you rerun it.
When the directories are ready, set the variables P2, P3, ... to the names of these directories by editing mirror.conf, and then run configure again.
./install
in order to schedule daily mirror updates, and purging of the temporary/cache directory. Once you have done this, these processes will begin automatically within 24 hours.
mirror-update -v
If you do so, do not allow this process to continue while your scheduled daily update is running. It is safe to interrupt this process at any time; if you run it again, it will continue from where it was interrupted.
The -v option (which may be omitted) causes the updater to report each file transfer as it happens.
Depending on your choice of optional volumes in step 8, and on the speed of your Internet connection, it may take several hours (or even days) to retrieve all of the files you have chosen to mirror the first time. Subsequent updates (see below) will be much faster; if your mirror is reasonably up-to-date, the mirror script will typically finish in 1 to 10 minutes.
Rarely, a daily update may require an unusually large download, as when a large amount of data have been added to the PhysioNet server, or if you add previously unmirrored volumes to your mirror. By default, your daily update will stop after 2 hours, and the next daily update will continue where the previous one ended.
When the update is completed, mirror-update sends a report to physionet.org, which it will digest and incorporate into the PhysioNet Mirrors page. The report lists the URL and geographic location of your mirror, and the time at which it was most recently updated, to help PhysioNet visitors choose a suitable mirror.
In order to give you a chance to test and make any necessary adjustments to your new mirror site, it will not appear on the Mirrors page until it has been running for a few days and has been updated at least once.
If your mirror meets the recommended requirements above (1 GHz or faster CPU, 512 MB RAM, and 10 GB spare disk space), you can run the PhysioBank ATM locally. (Otherwise, visitors are redirected to the PhysioNet master server if they follow links to the ATM.) If you wish to run the ATM, first allow your mirror to complete at least two daily updates. (This will ensure that your mirror's working copies of the WFDB software needed by the ATM are up-to-date.) Then, in this directory, run this command as root:
./enable-ATM
This command installs plt from PhysioToolkit if necessary, and then creates the ATM's cache directory. The ATM will begin working locally as soon as the cache directory has been created.
Test the ATM to verify that it is working properly. If it is not, please disable it until the problem can be corrected, so that your mirror's users can have ready access to the ATM services on the master server. To turn off local ATM services for any reason, return to this directory and run this command as root:
./disable-ATM
This command removes the ATM's cache directory, disabling local ATM services.
Once your site is listed on the Mirrors page (or linked to from any other public web page), the web spiders of the major search engines, such as Google, will begin indexing it. Typically these spiders will consume a significant amount of bandwidth when they first visit your site, but this will decrease to a much lower amount once your site is fully indexed. You can avoid almost all of this traffic if you wish (for example, if your mirror shares a network connection, or if your total monthly throughput is limited or metered by your ISP). Unless the network bandwidth consumed by the spiders is a problem, don't do this (it is useful if users can find pages on your mirror using a search engine, after all).
You can exclude most web spider traffic by modifying /home/physionet/html/robots.txt. Before doing this, modify your /usr/bin/mirror-update script by changing the line that reads:
rsync $RSOPTS physionet.org::physionet /home/physionet
to
rsync $RSOPTS --exclude robots.txt physionet.org::physionet /home/physionet
(This change will prevent the daily updates from replacing your customized robots.txt with the original one.)
Next, edit (or replace) /home/physionet/html/robots.txt so that its contents are:
User-agent: * Disallow: /
Make sure that the edited version has the same ownership and file access permissions as the original (owned by "pn", readable by anyone).
The robots.txt protocol is advisory, not mandatory, so making this change may not eliminate all traffic from web spiders, but it should greatly reduce that traffic at the very least.
Very little if any maintenance is required once your mirror site has been established as described above. Your web server will begin generating access logs, which will be rotated periodically and eventually discarded. If your logs are stored on a file system with little free space, you may need to clear them manually if your log file system fills up. (The names and locations of the logs are usually specified in httpd.conf. Under Fedora Linux, rotation and disposal of old log files is handled by the logrotate utility, run periodically by cron; no special setup is required unless you have renamed the log files in httpd.conf, or if you wish to keep old log files for more than four weeks.)
PhysioNet itself is growing, and you should occasionally check to see that your mirror site has room to grow with PhysioNet. Given that the cost of a gigabyte of disk storage is continually decreasing as density and speed are continually increasing, there is little reason to purchase more storage than you will need in the next six months to one year.
If you wish to begin mirroring an optional PhysioNet volume, and you have not previously mirrored any of the optional volumes, it is best to fetch a fresh copy of the PhysioNet mirror kit (see above). Edit mirror.conf (which will not have been altered by fetching a fresh kit), and run configure and install again as above. If you do this, please avoid changing your mirror's hostname, location, and maintainer as recorded in mirror.conf, so that your e-mail notifications will continue to be properly recognized by the master PhysioNet server.
If you wish to move a mirror to another host, or simply to discontinue mirroring for whatever reason, use crontab -e to edit your crontab and remove the mirror-update command line. Your mirror will be removed from the Mirrors page after a few days of inactivity.