\documentclass[twoside]{article} \usepackage{rawfonts} \IfFileExists{times.sty}{\usepackage{times}}{\@missingfileerror{times}{sty}} \usepackage{fancyhdr} \oddsidemargin 0.1in \evensidemargin -0.1in \topmargin -0.5in \textheight 650pt \footskip 48pt \def\textwidth{6.375 in} \pagestyle{fancy} \def\headrulewidth{0pt} \lhead{\rm{}Evaluating ECG Analyzers} \chead{\rm{}WFDB Applications Guide} \rhead{\rm{}Evaluating ECG Analyzers} \lfoot[\rm\thepage]{\rm{}WFDB VERSION} \cfoot{\rm{}LONGDATE} \rfoot[\rm{}WFDB VERSION]{\rm\thepage} \title{Evaluating ECG Analyzers} \author{George B. Moody\\ Harvard-MIT Division of Health Sciences and Technology, Cambridge, MA, USA} \date{} \begin{document} \setcounter{page}{FIRSTPAGE} \maketitle \section*{Summary} This paper describes how to evaluate an automated ECG analyzer using available annotated ECG databases and software, in compliance with standard evaluation protocols. These protocols have been adopted as parts of the {\em American National Standard for Ambulatory Electrocardiographs} (ANSI/AAMI EC38:1998, and its predecessor, ANSI/AAMI EC38:1994), and the {\em American National Standard for Testing and Reporting Performance Results of Cardiac Rhythm and ST Segment Measurement Algorithms} (ANSI/AAMI EC57:1998). They include earlier evaluation protocols developed for an AAMI Recommended Practice, {\em Testing and Reporting Performance Results of Ventricular Arrhythmia Detection Algorithms} (AAMI ECAR, 1987). It will be most useful to readers who plan to use the suite of evaluation software included in the WFDB Software Package ({\tt http://www.\-physio\-net.\-org/\-physio\-tools/\-wfdb.\-shtml}); this suite of software includes the reference implementations of the evaluation protocols specified in EC38 and EC57. \section{Introduction} Continuous monitoring of the electrocardiogram in both inpatients and ambulatory subjects has become a very common procedure during the past thirty years, with diverse applications ranging from screening for cardiac arrhythmias or transient ischemia, to evaluation of the efficacy of antiarrhythmic drug therapy, to surgical and critical care monitoring. Since the first intensive care units were established in the 1960s, the need for automated data reduction and analysis of the ECG has been apparent, motivated by the very large amount of data that must be analyzed (on the order of $10^{5}$ cardiac cycles per patient per day). As clinical experience has led to the identification of more and more prognostic indicators in the ECG, clinicians have demanded and received increasingly sophisticated automated ECG analyzers. The early heart rate monitors rapidly evolved into devices that were designed first to detect ventricular fibrillation, then other ``premonitory'' ventricular arrhythmias. Many newer devices attempt to detect supraventricular arrhythmias and transient ischemic ST changes. Visual analysis of the ECG is far from simple. Accurate diagnosis of ECG abnormalities requires attention to subtle features of the signals, features that may appear only rarely, and which are often obscured by or mimicked by noise. Diagnostic criteria are complicated by inter- and intra-patient variability of both normal and abnormal ECG features. Given these considerations, it is not surprising that developers are faced with a difficult task in the design of algorithms for automated ECG analysis, and that the results of their efforts are imperfect. Certain parts of the problem --- QRS detection in the absence of noise, for example --- are well-solved by most current algorithms; others --- detection of supraventricular arrhythmias, for example --- remain exceedingly difficult. Just as we may find it easiest to analyze ``textbook'' examples, automated ECG analyzers may perform better while analyzing the recordings used during their development than when applied to ``real-world'' signals. Since automated ECG analyzers vary in performance, and since their performance is dependent on the characteristics of their input, quantitative evaluations of these devices are essential in order to assess the usefulness of their outputs. At one extreme, a device's outputs in the context of a particular type of signal may be so unreliable as to be worthless; unfortunately, the other extreme --- an output so reliable it can be accepted uncritically --- is not a characteristic of any existing monitor, nor can it be expected in the future. \subsection{ECG Databases} Several databases of ECG recordings are generally available for evaluating ECG analyzers. They serve several important needs: \begin{itemize} \item They contain {\em representative} signals. Wide variations in ECG characteristics among subjects severely limit the value of synthesized waveforms for testing purposes. Realistic tests of ECG analyzers require large sets of ``real-world'' signals. \item They contain {\em rarely observed but clinically significant} signals. Although it is not particularly difficult to obtain recordings of common ECG abnormalities, often those that are most significant are rarely recorded. Both developers and evaluators of ECG analyzers need examples of such recordings. \item They contain {\em standard} signals. System comparisons are meaningless unless performance is measured using the same test data in each case, since performance is so strongly data-dependent. \item They contain {\em annotated} signals. Typically, each QRS complex has been manually annotated by two or more cardiologists working independently. The {\em reference} annotations produced as a result serve as a ``gold standard'' against which a device's analysis can be compared quantitatively. \item They contain {\em digitized, computer-readable} signals. It is therefore possible to perform a fully automated, strictly reproducible test in the digital domain if desired, allowing one to establish with certainty the effects of algorithm modifications on performance. \end{itemize} Standards EC38 and EC57 require the use of the following ECG databases:\footnote{Sources: ECRI, 5200 Butler Pike, Plymouth Meeting, PA 19462 USA (AHA DB); PhysioNet (http://physionet.org/) (MIT, NST, CU DB; and ESC DB for non-commercial use); Alessandro Taddei, CNR Institute of Clinical Physiology, G. Pasquinucci Heart Hospital, via Aurelia Sud, 54100 Massa, Italy (ESC DB for commercial use).} \begin{itemize} \item {\bf AHA DB}: The American Heart Association Database for Evaluation of Ventricular Arrhythmia Detectors (80 records, 35 minutes each) \item {\bf MIT DB}: The Massachusetts Institute of Technology--Beth Israel Hospital Arrhythmia Database (48 records, 30 minutes each) \item {\bf ESC DB}: The European Society of Cardiology ST-T Database (90 records, two hours each) \item {\bf NST DB}: The Noise Stress Test Database (12 records, 30 minutes each) \item {\bf CU DB}: The Creighton University Sustained Ventricular Arrhythmia Database (35 records, 8 minutes each) \end{itemize} Each of these databases represents a very substantial effort by many workers; in particular, the AHA, MIT, and ESC databases each required more than five years of sustained effort by large teams of researchers and clinicians from many institutions. Nevertheless, it should be recognized that even these databases do not fully represent the variety of ``real-world'' ECGs observed in clinical practice. Although these databases permit standardized, quantitative, automated, and fully reproducible evaluations of analyzer performance, it is risky to extrapolate from the results of such evaluations to expectations of real-world performance. Such extrapolations can be particularly error-prone if the evaluation data were also used for development of the analysis algorithm, since the algorithm may have been (perhaps unintentionally) ``tuned'' to its training set. It should also be noted that the first four of the databases listed above were obtained from Holter ECG recordings; although the frequency response of the Holter recording technique is not usually a limiting factor in the performance of an ECG analyzer, it may tend to favor devices that are designed to analyze Holter recordings over devices that have been designed to analyze higher-fidelity input signals. \subsection{Evaluation Protocols} Between 1984 and 1987, the Association for the Advancement of Medical Instrumentation (AAMI) sponsored the development of a protocol for the use of the first two of these databases, which was published as an AAMI Recommended Practice.\footnote{{\it Testing and Reporting Performance Results of Ventricular Arrhythmia Detection Algorithms}. Publication AAMI ECAR (1987); succeeded by ANSI/AAMI EC57:1998, available from AAMI, 1110 N Glebe Road, Suite 220, Arlington, VA 22201 USA.} Between 1990 and 1998, the ambulatory ECG subcommittee of the AAMI ECG committee developed and revised a standard for ambulatory ECG monitors, significant portions of which address the issue of the accuracy of automated analysis performed by some of these devices.\footnote{{\it American National Standard for Ambulatory Electrocardiographs}. Publication ANSI/AAMI EC38:1998; available from AAMI (address above).} The ambulatory ECG standard EC38:1998, and the ``testing and reporting performance results'' standard EC57:1998, build on the evaluation protocol adopted for the earlier Recommended Practice (ECAR), incorporating provisions for the use of all five of the databases listed above, with extensions for assessing detection of supraventricular arrhythmias and transient ischemic ST changes. The standard breaks new ground in establishing specific reporting requirements for the performance of automated ECG analyzers on standard tests using the databases listed above. A significant constraint imposed on evaluators by the EC38 standard is that they must obtain annotation files containing the analysis results of the device under test. Although the device itself need not produce these files, EC38 specifically requires that they be produced by an automated procedure, which must be fully disclosed. The intent of this requirement is to permit reproducible independent evaluations in which neither the proprietary data of the developers (the analysis algorithms) nor that of the evaluators (the test signals and reference annotations) need necessarily to be disclosed. By defining the interface between the developer and the evaluator to be the annotation file, the responsibilities of each party are clearly defined: the developer must make certain that the device's outputs are recorded in the annotation file in the manner intended by the developer, but in the language of the standard; the evaluator must make certain that the algorithms used to compare the device's annotation files with the reference annotation files conform to the specification of the standard. The format and content of these annotation files is specified in detail below. For many existing devices, it may be difficult or impossible to obtain such annotation files without the cooperation of the developers. Newly-designed devices should incorporate the necessary ``hooks'' for producing annotation files. \subsection{Software to Support Evaluations} This paper describes a suite of programs that support evaluations of automated ECG analyzers in accordance with the methods described in the EC38 and EC57 standards (as well as those in the earlier ECAR Recommended Practice). These methods are sufficiently complex that the development of such a suite of programs is not an afternoon's work. By making generally available reference implementations of the evaluation algorithms, much needless duplication of effort may be avoided. By circulating them in source form to other users, we may hope to find and correct any bugs, with the eventual result that evaluators of devices should not have to bear the burden of evaluating the evaluation technique itself. By using them for evaluations, any ambiguities in the English specification of the evaluation algorithms are resolved in a consistent manner for each device tested. These programs are written in C and run under MS-DOS or UNIX. They have been made available as part of the WFDB Software Package. In this paper, the names of these programs are printed {\tt like this}. \section{Evaluating an ECG Analyzer} The major task facing an evaluator is that of presenting the reference signals to the device under test, and collecting annotation files from the device. The details of this task will vary for each device, but a few general hints are given below. A second task, that of obtaining reference heart rate measurements, should be a much simpler job. Once all of this information has been gathered, the remaining work required --- that of comparing the device's analysis against the ``gold standard'' --- can be performed automatically. \subsection{Presenting Signals to the Analyzer} Two distinctly different types of tests are possible. If the device can accept digital inputs, the reference signals can be supplied in that form (perhaps after resampling with {\tt xform} to convert the digitized samples to the expected sampling frequency and numerical range, and possibly with additional digital signal processing to simulate the signal conditioning normally performed by the device's front-end data acquisition hardware). The primary advantage of testing in the digital domain is that the test is (or should be) strictly reproducible, since no noise or additional quantization error can be introduced in this way. This method usually avoids the issue of synchronization of the test annotations with the reference signals discussed below. Testing in the analog domain requires that analog signals be recreated from the digital signals. (It should be noted that even the analog versions of the MIT and AHA databases that have been available in the past were recreated from the digitized signals by the database developers.) The advantage of this approach is that it exercises the entire system, including the front-end data acquisition hardware. It is often difficult, however, to establish synchronization between the signal source and the analyzer, needed in order to permit comparisons of annotations. One way of dealing with this problem is to arrange for the analyzer's sampling clock to trigger the digital-to-analog converter used to recreate the analog signals, or to arrange for an external clock to trigger both D/A conversion in the playback system and A/D conversion in the analyzer. Another method is to begin and end the signal generation process by delivering signals from the analyzer to the playback device, and recording the analyzer's clock time at the times of the signals; assuming that both the analyzer and the playback device have stable clocks, event times in the analyzer's frame of reference can be converted to database sample numbers by linear interpolation. The WFDB software package includes a program ({\tt sample}) that uses a Microstar DAP 2400-series analog interface board\footnote{ Source: Microstar Laboratories, {\tt http://www.mstarlabs.com/}. External analog anti-aliasing filters (to reduce ``staircasing'') and attenuators (to obtain patient-level signals) may also be required, depending on the system to be evaluated. DAP boards can also be used with {\tt sample} to create new database records.} and an MS-DOS PC to recreate analog signals from digital database records on CD-ROMs or magnetic disk files. \subsection{Obtaining Test Annotation Files} For any ambulatory ECG monitor that incorporates automated analysis functions, the EC-38 standard requires the manufacturer to implement and disclose a method for producing test annotation files. Independent evaluators should seek assistance from the manufacturer in any case, since the manufacturer's interpretation of the device's outputs in the language of EC-38 is definitive (in effect, the annotation file generation technique becomes part of the system under test). Note that generation of annotation files need not be synchronous with data acquisition; a device might conceivably store all of the necessary data until the end of the test, and only then write the file. Neither does the standard require that an annotation be determined within any fixed amount of time, as would be expected of devices designed to trigger pacing, for example. Furthermore, EC-38 specifically allows for the possibility that the device under test might not produce the annotation file directly. If any external hardware or software is required to do so, however, it must be made generally available or specified in sufficient detail by the manufacturer to permit an independent evaluator to obtain test annotation files. Annotation files contain a label (an annotation) for each beat and for certain other features of the signals, such as rhythm and ST changes. Annotations are stored in time order in annotation files. The ``time'' of an annotation is that of the sample in the signal file with which the annotation is associated.\footnote{Times in annotation and signal files are usually expressed as {\em sample numbers} (the number of samples in the signal file that precede the sample in question).} The WFDB library (included in the WFDB software package) includes C-callable functions ({\tt getann} and {\tt putann}) for reading and writing annotations. In a C program, annotations appear as data structures containing a 32-bit {\tt time} field together with a pair of 8-bit fields that encode the annotation type and sub-type ({\tt anntyp} and {\tt subtyp} [sic], respectively), and a variable-length {\tt aux} field usually used to store text. In annotation files, these annotation structures are usually stored in a variable-length bit-packed format averaging slightly more than 16 bits per annotation.\footnote{Test annotations that include heart rate or ST measurements require substantially more storage. {\tt getann} and {\tt putann} can also use the original AHA DB format (containing fixed-length annotations, 16 bytes each), but this format should not be used for evaluations of devices that incorporate ST analysis functions, since the space available for the {\tt aux} data is too small to store ST measurements.} Test annotation files may include the following: \begin{itemize} \item {\em Beat annotations}. These need not coincide precisely with the reference beat annotations, since the evaluation protocol allows a time difference of up to 150 ms between each pair of matching beat annotations. All beat annotations are mapped during the evaluation process into the set \{ N, V, F, S, Q \} (corresponding to normal, ventricular ectopic, ventricular fusion, supraventricular ectopic, and unclassifiable or paced beats respectively); devices need not be capable of producing all of these annotations, but any beat annotations that they do produce will be translated into one of these types. The standard specifies the mapping used for the {\tt anntyp} values defined in {\tt }. (This file is included in the WFDB Software Package.) Any beat annotations that appear in the first five minutes of a record (the ``learning period'') are ignored in the evaluation process. The remainder of the record (the ``test period'') must be fully annotated. Note in particular that the last beat of some records may be very close to the last sample; since the analyzer may reach the end of the input signals before producing an annotation for the last beat, it may be necessary to ``pad'' the input data for a few seconds at the end of the record to permit the analyzer to emit its final beat annotation. \item {\em Shutdown annotations}. If the device suspends its analysis because of poor signal quality or for any other reason, it should mark the periods during which analysis is suspended. The evaluation software tallies beats missed during such periods separately from beats missed at other times. The beginning of each period of shutdown may be marked using a {\tt NOISE} annotation with ${\tt subtyp} = -1$, and the end of each period of shutdown may be marked using a {\tt NOISE} annotation with ${\tt subtyp} = 0$ (see the source for {\tt bxb} for notes on other acceptable methods of marking shutdown). \item {\em Ventricular fibrillation annotations}. The beginning and end of each detected episode of ventricular fibrillation should be marked using {\tt VFON} and {\tt VFOFF} annotations. \item {\em Other rhythm annotations}. These should include {\tt RHYTHM} annotations marking the beginning and end of each detected episode of atrial fibrillation. The beginning of each episode should be marked with an ``{\tt (AFIB}'' rhythm annotation, i.e., an annotation with {\tt anntyp} = {\tt RHYTHM} and {\tt aux} = \verb|"\05(AFIB"|, where ``\verb|\05|'' is C notation for a byte with the value 5 (ASCII control-E). Non-empty {\tt aux} fields always begin with a byte that specifies the number of data bytes that follow; in this case, the five characters ({\tt ( A F I B}) of the string. The end of each episode should be marked with any other rhythm annotation (for example, \verb|"\02(N"|). \item {\em Heart rate measurements}. Each type of heart rate measurement (including any heart rate or RR interval variability measurements) made by the device under test should be assigned a measurement number, $m$, between 0 and 127. A {\tt MEASURE} annotation should be recorded for each heart rate measurement, with ${\tt subtyp} = m$ and with the measurement in the {\tt aux} field, as an ASCII-coded decimal number. \item {\em ST deviation measurements}. If available, these should be provided in the {\tt aux} fields of beat annotations, as ASCII-coded decimal numbers indicating the deviations in microvolts from reference levels established for each signal from the first 30 seconds of each record. For example, ``{\tt 25 -104}'' indicates a 25 $\mu$V elevation in signal 0 and a 104 $\mu$V depression in signal 1. If ST measurements are omitted from any beat annotation, the evaluation software assumes they are unchanged from their previous values. \item {\em Ischemic ST change annotations}. These {\tt STCH} annotations should mark the beginning and end of each detected episode of ischemic ST change. ST change annotations have additional information in the {\tt aux} field as for rhythm annotations: the beginning of each episode is marked by an ``{\tt (ST}{\it ns}'' annotation, and the end of each episode by a ``{\tt ST{\it ns})}'' annotation, where {\it n} indicates the signal affected (``{\tt 0}'' or ``{\tt 1}''), and {\it s} indicates ST elevation (``{\tt +}'') or depression (``{\tt -}''). {\it n} may be omitted if the episode detection criteria depend on features of both signals. The extremum of each episode may optionally be marked with an ``{\tt AST}{\it nsm}'' annotation, where {\it n} and {\it s} are defined as above, and {\it m} is the ST deviation in microvolts, relative to a reference level established as above. \item {\em Comment annotations}. Annotations with {\tt anntyp = NOTE} and any desired string data in {\tt aux} may be included anywhere in an annotation file. {\tt NOTE} annotations are ignored by the standard evaluation software; they may be used, for example, to record the values of internal algorithm variables for debugging purposes. \end{itemize} Note that only beat annotations are absolutely required in test annotation files. ST deviation measurements within beat annotations, and the other types of annotations listed above, only need to be recorded for devices that are claimed by their manufacturers to provide optional features for detection of ventricular or atrial fibrillation, measurement of ST deviations, or detection of ischemic ST changes. If the time units in the test annotation files are not the same as those in the reference annotation files (for example, because {\tt xform} was used to change the sampling frequency of the signal files in a digital-domain test), the time units must be rescaled before proceeding with the comparison. This may be done by using {\tt xform} to rewrite the test annotation files with the original sampling frequency.\footnote{ The obvious alternative, using {\tt xform} to rewrite the reference annotation files at the time the signal files are resampled, should not be used in a formal evaluation. Because of the possibility that resampling the reference annotation files might result in moving reference annotations into or out of the test period, or changing the lengths of episodes, doing so might produce results that could not be directly compared with those obtained in a standard evaluation.} Details of the ST deviation measurement and episode detection criteria used in producing the reference annotation files for the ESC database may be found in several sources.\footnote{ See, for example, the {\it European ST-T Database Directory}, pp. vi-vii, supplied with the ESC DB; or Taddei, A., et al., ``The European ST-T database: development, distribution, and use'', {\it Computers in Cardiology} {\bf 17}:177-180 (1990).} Note, however, that many techniques for measuring ST deviation and for detecting transient ischemic ST changes are possible, and that to date the best evaluation results have been obtained for analyzers using criteria that do not attempt to mimic those used by the human experts who annotated the database. \subsection{Obtaining Reference Heart Rate Data} The final step of preparation for the evaluation is to process the reference annotation files to obtain reference heart rate annotation files. These files must contain heart rate measurement annotations with the same measurement numbers assigned as for the test heart rate annotations; they need not necessarily contain beat or other annotations from the reference annotation files. Quoting from EC38, \begin{quote} To evaluate the accuracy of heart rate measurement, the evaluator shall implement and disclose a method for obtaining heart rate measurements using the reference annotation files (the `reference heart rate'). This method need not be identical to the method used by the device under test, but in general it will be advantageous if it matches that method as closely as possible. \end{quote} It will generally be in the manufacturer's interest to provide a program for generating reference heart rate annotation files, to avoid the need for an independent evaluator to do so, with a likely result of less than optimal agreement with the test heart rate measurements. The WFDB software package includes a sample implementation of such a program ({\tt examples/refhr.c}); note that it will need to be customized for each device to be tested. Note that measurement errors are normalized by the mean value of the reference measurements in each record. Be certain that this mean value cannot be zero!\footnote{ For certain types of HRV or RRV measurements (though not for heart rate measurements), this is a potential problem. One solution is to add a small positive offset to any measurement with an expected zero mean. It is within the letter, though not the spirit, of the standard protocol, to add a very large number in such a case, so as to make the error percentage arbitrarily small. The mean value of the reference measurements must be reported; this should serve as a disincentive to this sort of creative abuse of the standard. An honest approach might be to add an offset on the order of the expected standard deviation of the individual measurements.} \section{Comparing Annotation Files} Once the test annotation files and the reference heart rate annotation files have been obtained, the remainder of the evaluation procedure is straightforward. All of the information needed to characterize the analysis performed by the device under test is encoded in the test annotation files; similarly, all of the information needed to characterize the actual contents of the test signals is encoded in the reference annotation and reference heart rate annotation files. The evaluation procedure thus entails comparison of the test and reference annotation files for each record. Four programs are provided in the WFDB Software Package for this purpose: \begin{itemize} \item {\tt bxb} compares annotation files beat by beat; its output includes QRS, VEB, and (optionally) SVEB sensitivity and positive predictivity, as well as RR interval error and shutdown statistics. \item {\tt rxr} compares annotation files run by run; its output includes ventricular (and, optionally, supraventricular) ectopic couplet, short run (3--5 beats), and long run (6 or more beats) sensitivity and positive predictivity. \item {\tt epicmp} compares annotation files episode by episode; its output includes ventricular fibrillation, atrial fibrillation, and ischemic ST detection statistics as well as comparisons of ST deviation measurements. \item {\tt mxm} compares measurements from a test annotation file and a reference heart rate annotation file; its output indicates measurement error.\footnote{ {\tt mxm} is not restricted to comparison of heart rate measurements; if other types of measurements are available, they may be compared in the same manner as heart rates by {\tt mxm}.} \end{itemize} The WFDB Software Package also includes three related programs: \begin{itemize} \item {\tt sumstats} reads certain output files generated by {\tt bxb}, {\tt rxr}, {\tt epicmp}, and {\tt mxm}, and calculates aggregate statistics for a set of records. \item {\tt plotstm} generates scatter plots of ST deviation measurements collected by {\tt epicmp}. \item {\tt ecgeval} automates the entire comparison procedure by running {\tt bxb}, {\tt rxr}, {\tt epicmp}, and {\tt mxm} for each record, collecting their output, then running {\tt sumstats} (and optionally {\tt plotstm}), and finally printing the results. \end{itemize} To obtain a concise summary of how to use any of these programs, including a list of any command-line options, simply run the program without any command-line arguments. Refer to the {\it WFDB Applications Guide}, which accompanies the WFDB Software Package, for details. In most cases, it will be easiest to collect all of the annotation files before beginning the comparison, and then to perform the comparison by typing: \begin{verbatim} ecgeval \end{verbatim} The program asks for the test annotator name, the names of the databases used for testing, and what optional detector outputs should be evaluated. Only the statistics required by EC38 and EC57 are reported by {\tt ecgeval}. If more detailed evaluation data are needed, it will be necessary to run {\tt bxb}, {\tt rxr}, etc., separately. If file space is extremely limited, it may be necessary to delete each test annotation file after it has been compared against the reference file, before the next test annotation file can be created; in such cases, it may also be necessary to prompt the user to change media containing signal or reference annotation files, or to reset the device under test before beginning each record. Optionally, {\tt ecgeval} can generate a script (batch) file of commands, which can be edited to accommodate special requirements such as these. For example, suppose we have obtained a set of test annotation files with the annotator name ``{\tt yow}'', which we wish to compare against the reference annotation files (annotator name ``{\tt atr}'')\footnote{ Annotation files for any given record are distinguished by annotator names, which correspond to the ``extension'' of the file name. The reference annotation files supplied with the databases have the annotator name ``{\tt atr}'' (originally ``{\tt atruth}'' because ``{\tt a}'' was intended to indicate the file type, and ``{\tt truth}'' because \ldots well, because the annotations are supposed to be The Truth).} and reference heart rate annotation files (annotator name ``{\tt htr}''). The portion of the evaluation script generated by {\tt ecgeval} for MIT DB record 100 is: \begin{verbatim} bxb -r 100 -a atr yow -L bxb.out sd.out rxr -r 100 -a atr yow -L vruns.out sruns.out mxm -r 100 -a htr yow -L hr0.out -m 0 epicmp -r 100 -a atr yow -L -A af.out -V vf.out -S st.out stm.out \end{verbatim} (The last two lines shown above form a single command. The {\tt mxm} command gathers statistics on measurement number 0; if other heart rate measurements are defined, {\tt mxm} should be run once for each such measurement, substituting the appropriate measurement numbers for {\tt 0} in the output file name, {\tt hr0.out}, and the final argument.) Statistics for the remainder of the MIT DB are obtained by repeating these commands, substituting in each the appropriate record names for {\tt 100}. Once these commands have been run for all of the records, the record-by-record statistics will be found in nine files ({\tt bxb.out}, {\tt sd.out}, {\tt vruns.out}, {\tt sruns.out}, {\tt hr0.out}, {\tt af.out}, {\tt vf.out}, {\tt st.out}, and {\tt stm.out}). The first eight of these files contain one line for each record.\footnote{ {\tt stm.out} contains one line for each ST deviation measurement that was compared; in this example, {\tt stm.out} would be empty since the reference annotation files of the MIT DB do not contain ST deviation measurements.} {\tt sumstats} can read any of these files, and calculates aggregate performance statistics; to use it, type ``{\tt sumstats} {\it file}'', where {\it file} is the name of one of these files. The output of {\tt sumstats} contains a copy of its input, with aggregate statistics appended to the end. Typically this output might be saved in a file to be printed later, e.g., \begin{verbatim} sumstats bxb.out >>report.out \end{verbatim} A scatter plot of the ST measurement comparisons performed by {\tt epicmp} can be produced using {\tt plotstm}, the output of which can be printed directly on any PostScript printer. For example, to make a plot file for {\tt stm.out}, type: \begin{verbatim} plotstm stm.out >stm.ps \end{verbatim} \section{Studying Discrepancies} Having conducted an evaluation as described above, a common question is ``what were the errors?'' {\tt bxb} and {\tt rxr} can help answer such questions. {\tt bxb} can generate an output annotation file (with annotator name ``{\tt bxb}'') in which all matching beat annotations are copied from the test annotation file, and each mismatch is indicated by a {\tt NOTE} annotation, with the {\tt aux} field indicating the element of the confusion matrix in which the mismatch is tallied (e.g., ``{\tt Vn}'' represents a beat called a VEB by the reference annotator and a normal beat by the test annotator). Programs such as {\tt wave}\footnote{ {\tt wave} (for FreeBSD, Linux, Mac OS X, Solaris, SunOS, and Windows) are included in the WFDB Software Package.} can be used to search for and display the waveforms associated with the mismatches. To generate an output annotation file, add the {\tt -o} option to the {\tt bxb} command line, as in: \begin{verbatim} bxb -r 100 -a atr yow -L bxb.out sd.out -o \end{verbatim} A particularly useful way to document an evaluation is to print a full disclosure report with {\tt bxb} output annotations, using the program {\tt psfd} (also included in the WFDB Software Package). This may be accomplished by preparing a file containing a list of the names of the records to be printed (call it {\tt list}), and then using the command: \begin{verbatim} psfd -a bxb list >output.ps \end{verbatim} The file {\tt output.ps} can be printed on any PostScript printer. Run {\tt psfd} without any arguments for a summary of its (numerous) options; try a short test before making a large set of printouts, which can take a long time. Both {\tt bxb} and {\tt rxr} accept a {\tt -v} option to run in ``verbose'' mode, in which each discrepancy is reported in the standard error output. When running {\tt rxr}, this feature is useful for finding missed and falsely detected ectopic couplets and runs. \section{Acknowledgements} Having been involved in the production of most of the databases as well as the design of the evaluation protocols, it has been my privilege to receive the benefits of the sustained contributions of many colleagues who have supported these projects with their dedicated efforts. I would like especially to thank Paul Albrecht, Jim Bailey, Ted Baker, Rich Bowser, Don Brodnick, Jerry Cox, Phil Devlin, Charlie Feldman, Scott Greenwald, Russ Hermes, David Israel, Franc Jager, Carlo Marchesi, Roger Mark, Joe Mietus, Warren Muldrow, Diane Perry, Scott Peterson, Ken Ripley, Paul Schluter, Alessandro Taddei, Roy Wallen, and Cees Zeelenberg. \appendix \section{Using the AHA Database} Since the AHA DB is not available in the standard PhysioBank format used by all of the other databases, the WFDB Software Package includes a pair of programs that convert files read from AHA DB distribution tapes or floppy disks into files in PhysioBank format. {\tt a2m} converts AHA annotation files, and {\tt ad2m} converts AHA signal files and also generates header ({\tt *.hea}) files. (Run these programs without command-line arguments to obtain instructions on their use.) Using {\tt a2m} and {\tt ad2m}, all 80 AHA DB records can be stored in roughly 130 Mb of disk space (assuming use of the standard 35-minute records). These programs can also reformat old (pre-1989) MIT DB tapes written in the AHA DB distribution format. It is also possible to read and write AHA tape-format files directly using the WFDB library; refer to the {\it WFDB Programmer's Guide} for details. \section{Noise stress testing} With respect to many tasks performed by an ECG analyzer, dealing with noise is the major problem faced by system designers. Although measurements such as ST deviation may be obtained reliably in clean signals, the presence of noise may render them inaccurate. In some instances, it is sufficient to recognize the presence of noise and either to mark measurements as unreliable or to avoid making measurements altogether. In other cases, excluding noisy data is inappropriate (for example, given the multiple correlations among physical activity, noise, and transient ischemia, excluding noisy signals is likely to introduce sampling bias in an ischemia detector). It is difficult to measure the effects of noise on an ECG analyzer using ordinary recordings. Even if existing databases include an adequate variety of both ECG signals and noise, the sample size is certainly too small to include all combinations of noise and ECG signals that may be encountered in clinical use. In ordinary recordings, it is difficult or impossible to separate the effects of noise from the intrinsic problems of analyzing clean signals of the same type. The noise stress test circumvents these problems. By adding noise in calibrated amounts to clean signals, any combination of noise and signal types is possible. Since both the noise-corrupted signal and the clean signal can be analyzed (in separate experiments) by the same analyzer, the effects of noise on the analysis are readily separable from any other problems that may arise while analyzing the clean signals. Finally, since the test can be repeated using different amounts of noise, it is possible to characterize analyzer performance as a function of signal-to-noise ratio. The major criticisms of the noise stress test are that not all noise is additive, and that the characteristics of the added noise may not perfectly match those of noise observed in clinical practice. These points, though formally irrefutable, do not negate the value of the test. In practice, most of the troublesome noise is additive; thus (given appropriate inputs) the noise stress test can simulate most of the noisy signals of interest. The NST DB includes noise recordings made using standard ambulatory ECG electrodes and recorders, but with electrodes placed on the limbs of active volunteers in configurations in which the subject's ECG is not apparent in the recorded signals. Given the recording technique used, it is not surprising that the characteristics of the recorded noise closely match those of noise in standard ambulatory ECG recordings. Although it may be argued that the particular muscles responsible for the recorded noise might produce different signals than those that generate the EMG present in noisy ECGs, no such differences are apparent from comparisons of either the signals or their power spectra. The NST DB includes a small set of ECG records with calibrated amounts of added noise. EC38 specifies that performance on these records must be reported, although no specific performance levels are required. Program {\tt nst} can be used to generate additional records for noise stress testing. To do so, choose an ECG record and a noise record (the latter may be {\tt bw}, {\tt em}, or {\tt ma} from the NST DB, or any other available noise recording). Run {\tt nst} and answer its questions to generate a noisy ECG record that may then be used in the same way as any other WFDB record. By default, {\tt nst} adds no noise during the first five minutes of the record, then adds noise for the next two minutes, none for the following two minutes, and repeats this pattern of two minutes of noise followed by two minutes of clean signals for the remainder of the record. The scale factors for the noise, if determined by {\tt nst}, are adjusted such that the signal-to-noise ratios are equal for each signal. The durations of the noisy periods, and the scale factors for each signal, are recorded in a {\em protocol annotation file}, which is generated by {\tt nst} unless an existing protocol annotation file is supplied as input. To change these parameters, simply edit the protocol annotation file (using, for example, {\tt rdann} to convert it to text form, any text editor to make the modifications, and {\tt wrann} to convert it back to annotation file format), then rerun {\tt nst} using the protocol file to generate a new record. \end{document}