| |||
PhysioNet
|
Advanced Search |
Tour |
Mirrors How to Cite | Contributing | FAQ |
||
| |||
We encourage you to set up a mirror of PhysioNet if you wish. If you do so, please use rsync as described below to minimize the impact on other users.
If you simply wish to retrieve a (possibly large) set of files from PhysioNet without downloading them one at a time, you don't need to set up a mirror of PhysioNet; see the PhysioNet FAQ for instructions on downloading an entire PhysioBank database in one step using rsync, or use GNU wget to download any desired selection of files.
PhysioNet is organized in "volumes" that can be mirrored separately. All PhysioNet mirrors provide the "base volume", which contains all of the software, tutorials, reference manuals, publications, and many of the PhysioBank databases. Additional volumes, which are optional for the mirrors, contain very large data sets that don't fit in the base volume. The base volume, and optional volumes 2, 3, and 4, will not exceed 25 Gb each. Volumes 5 through 8 will be allowed to grow up to one Tb each. More than one volume can occupy a single sufficiently large disk partition.
We use and recommend the following configuration for a PhysioNet web server:
All of the software needed is freely available open-source software. The cost of suitable hardware can be under US$500; of course it is possible to spend considerably more. If you cannot or do not wish to run Apache under Linux, many other configurations are possible, but we will not be able to help you troubleshoot your setup. Other versions of Linux or Unix, including Mac OS X, should be usable without difficulty. Although we do not recommend or support MS Windows, versions of all of the necessary software are freely available for MS Windows as optional components of Cygwin. The remainder of these notes assume that you are using Fedora 7.
Currently, 100 to 1000 Gb SATA drives are widely available and are usually least expensive per gigabyte. IDE (PATA) drives can also be used in many older PCs. SATA drives, and any drives larger than 127 GB, may require a controller card in PCs made before 2006. Most current PC motherboards include integrated SATA controllers, but many no longer support IDE drives.
Please let us know if you encounter any difficulties with this procedure!
If you are updating a mirror created before October 2007: Please replace your copy of /etc/httpd/conf/httpd.conf with a standard version (i.e., one that has not been customized for use with a PhysioNet mirror), since the previously used customizations will interfere with those that are installed as /etc/httpd/conf.d/pnm.conf by the procedure below. Also, please remove the line containing /usr/local/bin/mirror-physionet from your crontab; it will be replaced by a line that runs mirror-update when you run ./install in step 8 of the procedure below.
You will need to have root (administrative) privileges for some of the steps below. Once step 1 is complete, the remaining numbered steps can be finished in 10 to 15 minutes.
rsync -a physionet.org::mirror-setup mirror-setup
to verify that you are able to communicate with the PhysioNet master
server, and to download a few short files for setting up your mirror,
which will go into a subdirectory called mirror-setup within
the current directory (e.g., /home/joe/mirror-setup).
If mirror-setup doesn't exist already, rsync will
create it.
cd mirror-setup
If you wish, read through the various files you have downloaded to see
how they work, and then run:
./configure
The first time you run it, configure will ask a few questions, but
it will remember your answers (in mirror.conf) and will not ask them
again.
When the directories are ready, set the variables P2, P3, ... to the names of these directories by editing mirror.conf, and then run configure again.
./install
in order to schedule daily mirror and WFDB software updates, and
purging of the temporary/cache directory. Once you have done this,
these processes will begin automatically within 24 hours.
mirror-update -v
The -v option (which may be omitted) causes the updater to report each
file transfer as it happens.
It may take several hours (or even days, if your Internet connection is slow) to retrieve the entire PhysioNet web site the first time. If the process is interrupted at any point, simply run mirror-update again, and it will continue from the point where it was interrupted. Subsequent updates (see below) will be much faster; if your mirror is reasonably up-to-date, the mirror script will typically finish in 1 to 10 minutes.
When the update is completed, mirror-update sends a report to physionet.org, which it will digest and incorporate into the PhysioNet Mirrors page. The report lists the URL and geographic location of your mirror, and the time at which it was most recently updated, to help PhysioNet visitors choose a suitable mirror.
In order to give you a chance to test and make any necessary adjustments to your new mirror site, it will not appear on the Mirrors page until it has been running for a few days and has been updated at least once.
Once your site is listed on the Mirrors page (or linked to from any other public web page), the web spiders of the major search engines, such as Google, will begin indexing it. Typically these spiders will consume a significant amount of bandwidth when they first visit your site, but this will decrease to a much lower amount once your site is fully indexed. You can avoid almost all of this traffic if you wish (for example, if your mirror shares a network connection, or if your total monthly throughput is limited or metered by your ISP). Unless the network bandwidth consumed by the spiders is a problem, don't do this (it is useful if users can find pages on your mirror using a search engine, after all).
You can exclude most web spider traffic by modifying /home/physionet/html/robots.txt. Before doing this, modify your /usr/bin/mirror-update script by changing the line that reads:
rsync $RSOPTS physionet.org::physionet /home/physionetto
rsync $RSOPTS --exclude robots.txt physionet.org::physionet /home/physionet(This change will prevent the daily updates from replacing your customized robots.txt with the original one.)
Next, edit (or replace) /home/physionet/html/robots.txt so that its contents are:
User-agent: * Disallow: /
Make sure that the edited version has the same ownership and file access permissions as the original (owned by "joe", readable by anyone).
The robots.txt protocol is advisory, not mandatory, so making this change may not eliminate all traffic from web spiders, but it should greatly reduce that traffic at the very least.
Very little if any maintenance is required once your mirror site has been established as described above. Your web server will begin generating access logs, which will be rotated periodically and eventually discarded. If your logs are stored on a file system with little free space, you may need to clear them manually if your log file system fills up. (The names and locations of the logs are usually specified in httpd.conf. Under Fedora Linux, rotation and disposal of old log files is handled by the logrotate utility, run periodically by cron; no special setup is required unless you have renamed the log files in httpd.conf, or if you wish to keep old log files for more than four weeks.)
PhysioNet itself is growing, and you should occasionally check to see that your mirror site has room to grow with PhysioNet. Given that the cost of a gigabyte of disk storage is continually decreasing as density and speed are continually increasing, there is little reason to purchase more storage than you will need in the next six months to one year.
If you wish to begin mirroring an optional PhysioNet volume, and you have not previously mirrored any of the optional volumes, it is best to fetch a fresh copy of the PhysioNet mirror kit (see above). Edit mirror.conf (which will not have been altered by fetching a fresh kit), and run configure and install again as above. If you do this, please avoid changing your mirror's hostname, location, and maintainer as recorded in mirror.conf, so that your e-mail notifications will continue to be properly recognized by the master PhysioNet server.
If you wish to move a mirror to another host, or simply to discontinue mirroring for whatever reason, use crontab -e to edit your crontab and remove the mirror-update command line. Your mirror will be removed from the Mirrors page after a few days of inactivity.
| Send feedback about this page to PhysioNet |
|
Your comments and suggestions are welcome. We encourage you to use our feedback form to comment on this page. If you would like to receive a reply, please send your comments by email to webmaster@physionet.org, or post them to: MIT Room E25-505A 77 Massachusetts Avenue Cambridge, MA 02139 USA |
![]() |
Updated Saturday, 29-Dec-2007 11:04:14 EST