With every clinical encounter, surgeons make and interpret measurements. When they assess a patient prior to surgery, they inquire about the patient's age, height, and weight and make an assessment of the patient's pain, range of motion, and physical deformities. Subsequently, they monitor the heart rate, blood pressure, and urine output. All of these measurements are associated with some degree of error. Subconsciously, surgeons are aware of this error and, for every measurement that is taken, they decide how much error they are willing to accept. For example, surgeons would be content knowing the height of a patient within a margin of several centimeters, but a measurement of fracture displacement would need to be much more precise, in the magnitude of millimeters.
The expected range of measurements is the main factor that determines the amount of measurement error that is acceptable. The real worth of a measurement is in how effectively it can be compared to one or more other measurements, either between patients or from the same patient at different times. If the error of a measurement is as large as the expected difference between measurements, the instrument will be useless. In the measurement of height, the expected range of measurements might be 50 to 60 cm, so even with an error of 4 to 5 cm it is still possible to differentiate patients according to the categories of short, average, or tall height. When considering the extent of fracture displacement, the difference between anatomic alignment and severe malalignment may only be 10 to 20 mm, so the measurement error must be much smaller than that.
Reliability refers to the relationship between measurement error and the expected distribution of measurements over time and across observers and situations1,2. Reliability is not the same as agreement. The fundamental difference is that reliability is measured relative to the distribution of measurements. A new test that always yields a result of 100.00 regardless of the rater, patient, or any other circumstances would have perfect agreement but would provide no more information to the clinician. This important distinction makes reliability a much more powerful estimate of the usefulness of an instrument than simple measures of agreement are.
Statistically, reliability is the ratio of between-subject variability (in other words, the "true" differences between subjects) to the total variability (the "true" differences plus measurement error), and ranges from 0 (indicating that all of the variation in the sample is due to error) to 1 (indicating perfect reliability; i.e., all variation is due to "true" differences between subjects) (Fig. 1).
An instrument must be reliable in order to be useful in measuring differences between patients. Once investigators have established that an instrument is reliable, they must determine if it measures what it is intended to measure (validity), whether the time and expense is practical (feasibility), and whether clinicians will actually use it in practice (acceptability).
In this paper, we identify common aspects of reliability studies and suggest features that improve readers' confidence in the results. One concept serves as the foundation for all further consideration: in order for a reliability study to be relevant, the patients, raters, and test administration in the study must be similar to the clinical or research context in which the instrument will be used.
Few guidelines exist to assist readers in appraising reliability studies or to assist researchers in designing them3. In this section, we suggest seven questions surgeons can ask themselves as they read a report of a reliability study or write a protocol to conduct one (Table I).
Was the Research Question Appropriate?
Investigators undertaking a reliability study must precisely define which instrument(s) are being tested and how the instrument(s) will be used in clinical or research practice. Furthermore, investigators must determine what type(s) of reliability they will measure in the study. The most common measures are the internal consistency, intraobserver, test-retest, and interobserver reliabilities.
Internal consistency reflects the correlation between an individual's responses within an instrument and suggests whether or not the items seem to be measuring the same thing. For example, Leggin et al. measured the internal consistency of the Penn Shoulder Score, which includes three items regarding pain: pain at rest, pain with normal activities, and pain with strenuous activities4. It might be expected that individuals with a high level of pain at rest would also have substantial pain with normal and strenuous activities and, conversely, that patients with no pain at rest might have no or low levels of pain with normal and strenuous activities. In this study, the internal consistency (measured with use of the Cronbach alpha test for inter-item correlation5) was 0.93, which indicates a very high correlation between the items.
Perhaps because calculation of internal consistency requires only a single administration of an instrument, internal consistency appears commonly in the literature and authors may refer to the measurement as "reliability." There are, however, many potential sources of measurement error that this calculation does not incorporate, such as differences between times, observers, and settings. Therefore, internal consistency represents the weakest form of reliability, and readers should interpret the results with caution2.
The other three types of reliability share an important characteristic: they measure the agreement between two or more test administrations.
Test-retest reliability measures the extent to which one observer who is rating a subject on multiple occasions achieves similar results. Since time elapses between ratings, the characteristics being rated may also change. For example, the range of motion of a knee may change substantially over the course of a two-week period (a common interval used for test-retest reliability measurements).
Intraobserver reliability is similar to test-retest reliability, except that the characteristics being rated are fixed. Of course, this type of measurement is only possible in certain circumstances, such as during the rating of radiographs or videos. Since time is the only factor that varies between administrations, this form of study design will typically yield a higher reliability estimate than that obtained with test-retest or interobserver reliability studies.
Interobserver reliability measures the extent to which two or more observers obtain similar scores when rating the same subject. Interobserver reliability is the broadest and—when error related to observers is highly relevant—the most clinically useful measure of reliability. Since the intraobserver reliability will usually be higher than or equal to the interobserver reliability, if researchers document an acceptable level of interobserver reliability in the appropriate context, no further reliability testing is necessary. However, if the interobserver reliability is poor, knowledge of the test-retest or intraobserver reliability might assist researchers in identifying sources of error and in making appropriate modifications. Furthermore, measuring interobserver reliability is inappropriate if only one individual will apply the test (e.g., self-reported quality-of-life questionnaires); in this situation, the test-retest reliability is more appropriate.
Were the Raters Representative of the Individuals Who Will Apply the Instrument in Practice?
The individuals who make the ratings are an obvious potential source of variation. For instruments that are self-administered (such as quality-of-life questionnaires) the rater is also the subject; we will discuss the principles of selecting these individuals in the next section. Here we outline some important points to consider for situations in which one or more raters apply an instrument to multiple subjects (such as a fracture classification system). Two factors may contribute to the variability between raters: the expertise level of each rater, and the raters' practice settings.
With respect to the level of expertise, a reliable rating is more likely to be assigned by a rater with more training and experience than by a rater with minimal or no training and experience. If raters with varying levels of expertise use the tool in practice (as is usually the case), then including raters with all potential levels of expertise will provide more informative results.
The same principle applies to the raters' practice settings. Of course, it is usually not practical to conduct a study that incorporates every level of expertise and practice setting to which surgeons may wish to extrapolate the results; however, researchers should include as diverse and representative a group of raters as possible. When reporting the results of a reliability study, researchers should state who the raters were and provide information regarding the expertise of the raters in the particular rating process.
Were the Patients or Subjects Representative of the Population That Will Be Rated in Practice?
The principles of selecting the patients or subjects for the study are very similar to those discussed for the raters. The patients in the study should represent the actual population that the clinicians will evaluate in clinical practice. For example, in a study assessing knee laxity, investigators measured the intra-rater and inter-rater reliability in a group of twenty healthy volunteers6. Unfortunately, this study is of little relevance to clinicians, who are not interested in measuring knee laxity in healthy individuals. This study would have been strengthened if the investigators included patients with very stable knees, very unstable knees, and knees for which stability fell somewhere between the two extremes.
Including patients who represent a broad range of pathology, disability, or whatever the measurement focus of the study happens to be also provides a statistical advantage. Intuitively, it might appear that clinicians would be more likely to agree on ankle stability in a group of healthy volunteers. This highlights the difference between agreement and reliability: although the raw agreement may be higher in a homogeneous (similar) sample, the reliability will be lower.
Figures 2-A, 2-B, and 2-C depict this principle graphically. The panels in Figures 2-A and 2-B represent two studies with homogeneous groups of subjects at both extremes of a scale. In each of these cases, the true between-subject variability (the solid lines) is small relative to the error, or between-rater variability (the dashed lines). The panel in Figure 2-C represents the results from a study involving subjects from each of the extremes, plus a group in the middle. Here the between-subject variability is much larger relative to the error, so the reliability that is measured in this study will be substantially higher than that measured in the other two studies.
Thus, reliability can be manipulated. If you wish to make your instrument appear highly reliable, include normal subjects and those with extreme pathology or dysfunction. If you wish to make a competitor's instrument appear unreliable, choose a homogeneous population. Researchers should resist these temptations and recruit patients who are representative of the spectrum of disease that clinicians will see in practice.
Did Raters Assign the Ratings in a Clinically Relevant Manner?
The administration of the ratings will vary depending on the nature of the raters and the subjects. Nevertheless, the objective of the rating sessions should be the same: to mimic, as closely as possible, the clinical practice environment. For example, when assessing the reliability of a fracture classification system that involves the estimation of lengths and angles, raters should only use tools such as rulers or protractors if they would use them in the real-life clinical practice. Furthermore, researchers must consider what additional information that they will make available to raters, such as patient history or other physical findings. The most pragmatic approach is to provide raters with as much clinical information as they would normally have access to in clinical practice7. If, however, researchers wish to determine the impact of different instruments in isolation, they should only provide subject information that is directly relevant to the instrument being tested. Returning to the example of knee laxity, knowledge of the clinical history of a participant could easily influence a rater: clinicians would expect healthy volunteers to have stable knees, while injured patients would be much more likely to demonstrate instability.
Irrespective of the context, all raters should independently complete the evaluations in similar test settings. For example, in a study of classification systems for fractures of the distal part of the radius8, each rater might view digital radiographs on a personal computer or hard copies from a light box. Either method would be acceptable (the ideal method would be whichever was used most commonly in clinical practice), but it would not be appropriate for some individuals to view the images on a computer and others to view hard copies unless this variability represented the regular practice of the raters.
A web-based approach is an innovative method of administering reliability studies for radiographic images, such as fracture classification systems. Current web-based technology allows researchers in North America to send images to Asia faster than they can walk into an adjoining office. Researchers in a variety of medical fields have reported that web-based technology has improved efficiency and collaboration in clinical research and practice9-11. The Collaboration for Outcome Assessment in Surgical Trials (COAST) has developed a web-based methodology for conducting reliability studies of radiographic images12.
Were the Data Analyzed with Use of Appropriate Reliability Statistics?
Statisticians have described a wide variety of techniques to measure agreement or reliability (Table II). Given the broad analytical options, investigators should consider calculating and reporting more than one statistical estimate13. We will briefly discuss some common forms of reliability analyses for categorical and continuous data; readers interested in learning more about reliability analyses should refer to a statistical text or focused review2,14,15.
Categorical Data
The simplest measure of agreement, the proportion or percentage agreement, fails to address the agreement that one would expect due to chance. Consider, for example, the data in Table III, summarized from an intraobserver study of resistance testing in subjects with shoulder pain16. Adding the "agreement" cells and dividing by the total yields the proportion agreement; 93% (26 of 28) in this case, which seems extremely good. Table IV displays the data that would result if raters guessed at random, but with the same overall distribution of "strong" to "weak." Here the raw agreement is 79% (22 of 28), which is also quite good. Clearly, the value of 93% does not accurately reflect the reliability of this measure, because it does not account for the agreement that may be due to chance alone. Fortunately, there are several statistical approaches that do address chance agreement.
The kappa coefficient, the most commonly reported statistic in orthopaedic fracture reliability studies1, accounts for chance agreement in categorical responses by comparing the observed agreement with the possible agreement beyond chance17. This statistic yields a maximum value of 1.0 (indicating perfect agreement), with 0.0 indicating no agreement beyond chance, and negative values indicating agreement worse than chance. Examining the shoulder stability data once more (Table III), the kappa is 0.63, substantially lower than the raw agreement of 0.93.
Researchers can use kappa to calculate agreement for two or more observers, and with two or more categories of response. In the latter context, if some responses are closer than others (i.e., most commonly ordered responses, such as a severity score of 1 to 4) they can employ a "weighted" kappa that incorporates partial agreement18. One disadvantage of kappa occurs when the distribution of responses is very skewed: in this case there is little room for agreement above chance, so kappa may be deceivingly small19.
The phi statistic is a measure of "chance-independent" agreement20. The biggest advantage of phi is its resistance to skewed distributions. The phi statistic from the shoulder stability data is 0.75, a reliability estimate between the values calculated with kappa and the raw agreement. Since the distribution of responses from this study is skewed, phi is probably the best measure of the reliability. Despite this attractive feature, phi is uncommonly reported in medical statistics.
Continuous Data
The Pearson correlation represents a familiar approach to continuous data, but it is limited in that two sets of measurements may be perfectly correlated but have poor agreement. Figure 3 demonstrates this point with data from two hypothetical reliability studies: in both studies, the ratings from reviewer 1 are perfectly correlated with the ratings from reviewer 2. In one study, however, the agreement between reviewers is perfect (red line), while in the other study the reviewers do not actually agree on any measurements (green line). Therefore, the Pearson correlation insufficiently describes the relationship between two variables for the purposes of a reliability study.
Intraclass correlation coefficients are a set of related measures of reliability, derived from a repeated measures analysis of variance21, that yield a value that is closest to the formal definition of reliability. One intraclass correlation coefficient measures the proportion of total variability that is due to true between-subject variability22. Although analysts most commonly calculate an intraclass correlation coefficient for continuous outcomes, when applied to categorical data it is equivalent to the weighted kappa with quadratic weighting. Several variations of the intraclass correlation coefficient facilitate its use in addressing a variety of reliability issues2.
In summary, data analysts have many statistical options available to estimate the reliability of two or more sets of measurements. The following are the most commonly reported statistics: kappa for dichotomous responses, weighted kappa for polytomous (more than two categories) responses, and intraclass correlation coefficient for continuous data. Investigators who encounter more complex analytical situations, such as a comparison of two or more reliability estimates23,24 or separation of the error into multiple sources (such as observers, times, and locations) in a single analysis23,25, should involve a statistician familiar with these techniques.
How Was the Sample Size (of Raters and Subjects) Determined?
Researchers control the size of two samples in reliability studies: the number of raters and the number of subjects. Although increasing the number in either group will yield a more precise reliability estimate (a narrower confidence interval), the number of subjects has a much greater impact on the precision than the number of raters does (especially when there are more than four or five raters)2. Therefore, we recommend determining the number of raters based on generalizability and feasibility, then estimating the number of subjects required to achieve the desired precision.
The number of raters that are needed to satisfy the generalizability requirement depends on the characteristics of the raters. The feasibility of performing multiple ratings also depends on the nature of the subjects: radiographs can easily be rated several times by different individuals, but living patients are unlikely to be as accommodating. Thus, the ultimate decision about the number of raters to include involves balancing the theoretical benefits of increased generalizability with feasibility considerations.
When the number of raters has been determined, investigators can perform a sample-size calculation to estimate the required number of subjects. As with any sample-size estimation, the calculation is dependent on the analysis plan. We will describe the approach to sample-size estimation for studies that use an intraclass correlation coefficient; interested readers can find estimates for other reliability statistics in the cited material15.
Researchers may use two approaches to estimate the appropriate number of subjects. In the first method, investigators choose the minimum acceptable reliability and estimate the sample size needed to prove that the actual reliability is higher26. In most reliability studies, the minimum acceptable reliability is not intuitively obvious. The second approach is based on the desired precision of the reliability estimate. The calculation incorporates the number of raters, the expected intraclass correlation coefficient (estimated from past studies or simply a "best guess"), the confidence interval (usually 95%), and the width of the confidence interval. Table V displays sample-size estimates for selected parameters; readers may find a description of the full calculation elsewhere27.
How Can the Results Be Interpreted?
Fortunately, most of the statistics that we have discussed yield values on the same scale: 0.0 indicates that all of the variability is due to error, and 1.0 indicates that all of the variability is due to true between-subject differences. Unfortunately, reliability studies rarely yield estimates close to either of these values; actual results are more likely to be somewhere between 0.3 and 0.71. So what is an "acceptable" level of reliability?
Researchers have proposed guidelines to assist readers in interpreting reliability estimates28-32. All of these are variations of the same theme and not surprising: values close to 0 (or negative) represent poor reliability, values close to 1.0 represent excellent reliability, and values around 0.5 represent moderate reliability. Ultimately, whether or not a given level of reliability is acceptable will depend on the context of the measurement and the other instruments available. If the instrument being studied is the only tool available to measure an important quality, then it will have to suffice until investigators develop a more reliable tool.
Because interpretation of a reliability study is context-specific, readers must determine if the raters, subjects, and instrument administration in the study reflect their clinical or research setting. If the contexts are similar, readers may comfortably expect similar reliability in their setting. However, if the settings are sufficiently different, readers must apply the results cautiously or repeat the reliability testing in more applicable circumstances.