Abstract
Current elbow-scoring systems are based on the observer-derived assessment of a variety of clinical and functional criteria, which are scored separately and then aggregated. The aggregate score then is assigned a categorical ranking that ranges from excellent to poor. The developers of different elbow-scoring systems have chosen different outcome criteria, assigned different weights to each criterion, and accorded different ranges of values to each categorical ranking. Five different elbow-scoring systems (the Mayo elbow-performance index and the systems of Broberg and Morrey, Ewald et al., The Hospital for Special Surgery, and Pritchard) were used to evaluate the same group of patients. The validity of the scoring systems was determined with use of visual-analog scales for the assessment of pain and function, patient and physician-derived ratings of the severity of impairment of the elbow, and two functional questionnaires completed by the patient (the Disabilities of the Arm, Shoulder and Hand questionnaire and the Modified American Shoulder and Elbow Surgeons patient self-evaluation form). The study sample consisted of sixty-nine patients who had sought treatment at one of two tertiary referral clinics because of problems related to the elbow. Pearson product-moment correlation coefficients were used to compare the raw aggregate scores, and kappa statistics were used to determine the level of agreement among the categorical rankings (excellent, good, fair, and poor).Examination of the five scoring systems revealed a remarkable lack of concordance with regard to the aspects of elbow function that were assessed. Good correlation was observed when the systems were compared on the basis of raw scores (Pearson product-moment correlation coefficients, 0.79 to 0.90), but only slight-to-moderate correlation was noted when the systems were compared on the basis of categorical rankings (quadratic weighted kappa coefficients, 0.18 to 0.49). Validity testing showed the system of Ewald et al. and the Mayo elbow-performance index to be the most discriminating, the system of Pritchard to be the least discriminating, and the system of The Hospital for Special Surgery and the system of Broberg and Morrey to be intermediate. The scores determined with the elbow-scoring systems demonstrated only moderate correlation with the score for function on the visual analog scale (Pearson product-moment correlation coefficients, 0.44 to 0.66), whereas those derived from the functional questionnaires completed by the patient demonstrated moderate-to-good correlation with the score for function (Pearson product-moment correlation coefficients, 0.72 and 0.80).CLINICAL RELEVANCE: We observed a remarkable lack of agreement when five different elbow-scoring systems were used to determine categorical rankings for the same cohort of patients. The correlations between the raw aggregate scores were better. On the basis of these findings, we believe that outcomes should be expressed as raw scores rather than as categorical rankings. We also found that scores derived from patient-completed functional questionnaires correlated more closely with perceived functional loss than did those determined with aggregate elbow-scoring systems. It must be recognized that comparisons between studies that are based on different scoring systems are not valid and that the categorical rankings of different systems are not interchangeable. The outcome of therapies designed for the treatment of the elbow should be determined on the basis of a patient-derived assessment of function, a clinical examination, and an assessment of pain.
Scoring systems that are designed to assess the severity of impairment of the elbow should accurately and reproducibly describe function, measure outcome, and facilitate communication between clinicians. However, the use of different scoring systems has led to disagreement as to when an outcome should be considered excellent or good11,21. In a previous review of the literature, two of us (D. E. B. and R. R. R.) and colleagues found that at least fourteen different definitions for these terms have been used to describe the outcome after operations on the elbow4. Clinicians may mistakenly assume that the categorical rankings (excellent, good, fair, and poor) that have been used in independently designed scoring systems describe similar levels of impairment. The variable definition of terms hinders accurate communication between investigators and is an impediment to the objective comparison of the results of different studies14.
Current elbow-scoring systems are based on the assessment of domains such as range of motion, pain, and ability to perform daily activities, which are scored separately; the scores then are aggregated and assigned categorical rankings that range from excellent to poor. Each scoring system is based on a variable admixture of clinical and functional criteria. The developers of different elbow-scoring systems have assigned different weights to each domain and different ranges of values to each categorical ranking. As a result, the same categorical ranking (for example, excellent) is defined differently in each scoring system. Consequently, the same patient may have different raw scores and different (or similar) categorical rankings depending on which scoring system is used.
When an observer questions a patient with regard to function and then records the responses, the possibility of observer bias is introduced. Patient-completed functional questionnaires have been developed and tested with psychometric and clinimetric methods2,3,15. Raw scores, rather than categorical rankings, are derived from such questionnaires. Region-specific functional questionnaires have been shown to be more responsive to change in the function of the upper extremity than general health-status questionnaires3.
The primary purpose of the present study was to compare five different elbow-scoring systems in order to determine their ability to reproducibly describe the severity of impairment of the elbow in the same group of patients. The secondary purpose was to compare the validity of elbow-scoring systems (from which categorical ratings are derived) with that of patient-completed functional questionnaires (from which raw scores are derived).
*No benefits in any form have been received or will be received from a commercial party related directly or indirectly to the subject of this article. No funds were received in support of this study.
†205-4040 Finch Avenue East, Scarborough, Ontario M1S 4V5, Canada.
‡Institute for Work and Health, 250 Bloor Street East, Suite 702, Toronto, Ontario M4W 1E6, Canada. E-mail address for Ms. Beaton: dbeaton@iwh.on.ca.
§Upper Extremity Reconstructive Service, St. Michael's Hospital, 55 Queen Street East, Suite 800, Toronto, Ontario M5C 1R6, Canada. E-mail address for Dr. Richards: richardsr@smh.toronto.on.ca.
Clinical Usefulness of the Scoring Systems
The key features that determine the value of an assessment tool are its clinical usefulness and its construct validity5,17,20. The clinical usefulness of five elbow-scoring systems (the Mayo elbow-performance index22 and the systems of Broberg and Morrey, Ewald et al., The Hospital for Special Surgery11, and Pritchard) was assessed in terms of a number of specific criteria, including the feasibility of use in an outpatient orthopaedic clinic, the ease of administration, the availability of standardized instructions, the associated costs, the time needed for completion, the burden on the respondent, the face validity, and the content validity5,8,20. The theoretical aspects of the scoring systems were assessed before the patients were interviewed, and the practical aspects were assessed after the patients had been evaluated but before the data were analyzed. The clinical usefulness of each system was determined by consensus, without the use of a standardized form.
Sample
The study sample was drawn from a population of unselected patients who were seen at one of two tertiary referral upper-extremity clinics because of problems related to the elbow. The patients either were being managed non-operatively or had had an operative procedure and were being reassessed postoperatively. A diverse sample was used so that a wide spectrum of severity of impairment could be analyzed. The assessment took place during a regular clinic visit.
The size of the sample was based on two criteria. The size that was needed for the correlation analysis of the five scoring systems was determined first. We anticipated moderate-to-good correlations (r = 0.70) and based the sample size on that figure. According to the methodology of Lachin (which sets the probability of a type-I error at 0.05 and the probability of a type-II error at 0.20), eleven subjects were needed. Adjustment for multiple correlations (that is, the five scoring systems) indicated that at least fifty-five subjects were needed.
The size of the sample that was needed for the analysis of the test-retest reliability of the patient-completed functional questionnaires was determined next. The formula suggested by Donner and Eliasziw was used for the detection of a reliability coefficient of 0.80 while assuming that the reliability would be at least fair to moderate. According to these parameters and the criteria described by Lachin, it was necessary for at least twenty to twenty-five patients to complete the questionnaire package twice.
Consent
The study was approved by the Research Ethics Board of our institution. Written consent was obtained from each patient. The consent form included information on the study, ensured confidentiality, and emphasized that participation was voluntary and would not affect the care that the patient received. A copy of the consent form was given to each patient to take home.
Collection of Data
Data were obtained by means of a physical examination of the elbow and the completion, by the patient, of a questionnaire package (Fig. 1). One of us (D. C. T.) examined all of the elbows and recorded information on range of motion, strength, and stability as required by the five scoring systems (Table I). Isometric strength of the flexors of the elbow was measured three times with use of a Chatillon strength dynamometer (model CSD 200; Chatillon Medical Products, Greensboro, North Carolina), and the mean value was recorded. Similarly, grip strength was measured three times with use of a Jamar dynamometer (Asimow Engineering, Los Angeles, California) on the second setting, and the mean was recorded. The information that was needed to determine the five elbow scores was derived with use of a standardized protocol that took less than twenty minutes to complete; the data were extracted later in order to minimize patient fatigue.
The questionnaire package included items on demographic characteristics (age, gender, hand dominance, duration of pain, and so on) as well as on a number of variables that were used for validity testing, as will be discussed. Additional questions from the elbow-scoring systems also were included in the questionnaire package.
Management and Analysis of Data
Copies of the standardized assessments and their matched questionnaires were identified with use of a study number that was recorded only on the forms. The data were entered and checked for accuracy with Epi Info software (USD, Stone Mountain, Georgia) and then were analyzed with SAS software (SAS Institute, Cary, North Carolina).
Distribution of Responses
The data that were obtained with the five elbow-scoring systems and the two patient-completed questionnaires were described in terms of the frequency distribution, the response pattern, the central tendencies (the mean and the median), and the spread (the standard deviation and the range of responses). The distribution of the responses was analyzed for skew toward either end of the response spectrum in order to determine the presence of a ceiling or floor effect. The raw scores for the individual items of each scoring system were added, and the aggregate score then was assigned a categorical ranking of excellent, good, fair, or poor on the basis of a 100-point scale. The raw scores are interpreted differently according to each scoring system (Table II).
Agreement among the Scoring Systems
Raw scores: The raw scores of the five systems were compared with use of Pearson product-moment correlation coefficients. A value of more than 0.75 indicated good correlation; 0.30 to 0.75, moderate correlation; and less than 0.30, chance or trivial correlation.
Categorical rankings: Quadratic weighted kappa coefficients were used to determine the level of agreement between each possible pair of categorical rankings. An overall kappa coefficient was used to calculate agreement, beyond chance alone, among all five scoring systems. The kappa coefficients were interpreted according to the criteria described by Landis and Koch19,28. A value of 0 to 0.20 indicated slight agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, substantial agreement; and more than 0.80, almost perfect agreement. A kappa coefficient that is less than 0 reflects agreement that is less than would be expected by chance alone.
Construct Validity
Construct validity indicates the degree to which a scoring system is associated with other measures that are hypothesized to have a specific relationship with the system. Testing of construct validity builds confidence in a scoring system. The measures that were tested in the present study included the patient and physician-derived assessments of the severity of impairment, the level of pain, the ability to perform normal activities of daily living, and the responses on contemporary patient-completed functional questionnaires.
The patient was asked to rate the overall severity of the impairment (on a 5-point Likert-type scale) as very mild, mild, moderate, severe, or very severe. The examiner (D. C. T.) subjectively rated the severity of the impairment on the same scale; this rating was based on the examiner's clinical judgment and was rendered after the completion of the physical examination and the standardized assessment protocol but before the administration of the patient-completed functional questionnaires. The aggregate scores from the five elbow-scoring systems then were compared across the patient and physician-rated levels of severity with use of one-way random effects analysis of variance. If the mean scores were found to be significantly different from each other (p < 0.05), Duncan post hoc testing was performed in order to ascertain which of the individual levels of severity differed from one another.
The level of pain and the ability to perform the normal activities of daily living were assessed with two visual-analog scales. Each scale consisted of a ten-centimeter-long horizontal line, with zero centimeters indicating no pain or difficulty and ten centimeters indicating severe pain or difficulty16.
Disability was evaluated with use of the Modified American Shoulder and Elbow Surgeons (M-ASES) patient self-evaluation form3,4,24 and the Disabilities of the Arm, Shoulder and Hand (DASH) questionnaire15. The former focuses exclusively on functional limitations, whereas the latter incorporates questions related to functional limitations, symptoms, and psychosocial problems. We assumed that more severely disabled individuals (those who had a high score on the DASH questionnaire and a low score on the M-ASES questionnaire) would have lower aggregate scores.
Pearson product-moment correlation coefficients were used to determine the association between the values on the visual-analog scales, the responses on the patient-completed functional questionnaires, and the raw aggregate scores on the elbow-scoring systems.
Test-Retest Reliability of the Functional Questionnaires
The reliability of the patient-completed functional questionnaires had not been determined previously in the population of patients with problems related to the elbow. In order to provide data for the determination of test-retest reliability, twenty-eight patients completed another set of questionnaires two or three days after they had completed the first. The patients who were selected for this portion of the study had a stable condition, had not had an operation in the preceding three months, and had received no interventional treatment in the forty-eight to seventy-two hours before the questionnaires were completed for the second time. Test-retest reliability was analyzed with use of the techniques described by Shrout and Fleiss as well as by Bartko. Two-way analysis of variance was performed in order to determine the intraclass correlation coefficient. Pearson correlation coefficients also were calculated as another means of describing reliability (trend).
We did not evaluate the reliability of the maneuvers that were performed during the physical examination or that of the information that was obtained with use of the standardized protocol because one examiner evaluated all of the patients under ideal conditions in a controlled environment6,12,13,25,29.
Clinical Usefulness of the Scoring Systems
The system of The Hospital for Special Surgery had a simple format. The associated cost was low because the only necessary equipment was a five-pound (2.27-kilogram) weight, a two-pound (0.91-kilogram) weight, and a goniometer. The examiner required little training, and use of the system in a clinic was practical. Stability of the elbow was not included in the content of the scale. The scaling was ordinal and seemed appropriate. The scoring was done easily, and the weighting of the domains seemed sensible.
The system of Ewald et al. had a clear format. The associated costs were low because only a goniometer was needed. Little training was required. The system could be used in a clinic, although it was difficult to assess varus-valgus deformity in patients who had a flexion contracture of the elbow. Neither strength nor stability was included in the content of the scale, and motion was assessed only in terms of flexion and extension. The scaling was ordinal and seemed appropriate. Pain was weighted heavily, and motion was weighted lightly.
The system of Pritchard was more confusing, particularly with regard to the scoring of range of motion. The cost was somewhat higher because several weights were required. Little training was needed, and the system was suitable for use in a clinic. The content of the scale was poor in that function, deformity, and stability of the elbow were not assessed. The scaling was ordinal and seemed appropriate. Pain was weighted heavily.
The system of Broberg and Morrey had a clear format, and the cost of the necessary equipment was low. Little training was required, and the system was suitable for use in a clinic. Deformity was not included in the content of the scale. The scaling was ordinal and seemed appropriate. The scoring was done easily, and the weighting of the domains was reasonable.
The Mayo elbow-performance index had a clear format. The associated costs were low because only a goniometer was necessary. Little training was needed, and the system was suitable for use in a clinic. Neither strength nor deformity was included in the content of the scale, and motion was assessed only in terms of flexion and extension. The scaling was ordinal and seemed appropriate. Function and motion were weighted less heavily than pain.
Demographic Data
Sixty-nine patients participated in the study. Forty patients were male, and twenty-nine were female. The mean age of the patients was forty-six years (range, seventeen to eighty-nine years). The dominant elbow was affected in thirty-three of the sixty-five patients who expressed hand dominance. Fifty-two patients had stopped work or leisure activities, or both, in the six months before they sought treatment because of pain in the elbow. The pain was associated with a variety of diagnoses, including lateral epicondylitis (twenty-one patients), a fracture of the radial head (nine patients), an intra-articular fracture of the distal part of the humerus (seven patients), primary osteoarthrosis (seven patients), medial epicondylitis (five patients), rheumatoid arthritis (five patients), a fracture of the olecranon (five patients), a dislocation of the elbow (four patients), a Monteggia fracture-dislocation (two patients), instability due to dysfunction of the medial collateral ligament (one patient), bursitis of the olecranon (one patient), an undiagnosed contracture of the elbow (one patient), and an infection at the site of a total elbow arthroplasty (one patient).
Distribution of Responses
The mean raw aggregate scores (and standard deviations) ranged from 69 ± 17 points to 80 ± 16 points (Table III). The scores ranged from 24 to 96 points with the system of The Hospital for Special Surgery, from 32 to 100 points with the system of Ewald et al., from 12 to 100 points with the system of Pritchard, from 4 to 100 points with the system of Broberg and Morrey, and from 15 to 100 points with the Mayo elbow-performance index (Fig. 2). A ceiling effect (observed when one or more patients achieved the maximum score of 100 points) was noted in association with all of the systems except that of The Hospital for Special Surgery.
The proportion of the sample that was assigned each of the categorical rankings varied widely among the systems (Table IV). Most of the rankings were either fair or poor with use of the system of Ewald et al. and the system of Broberg and Morrey, whereas most were either excellent or good with the system of Pritchard and the Mayo elbow-performance index. One-quarter of the rankings derived with the system of The Hospital for Special Surgery were designated as failed.
Agreement among the Scoring Systems (Table V)
Raw scores: Good correlation was observed when the systems were compared on the basis of raw aggregate scores: the Pearson product-moment correlation coefficients ranged from 0.79 (for the comparison between the system of The Hospital for Special Surgery and the system of Pritchard) to 0.90 (for the comparison between the system of Pritchard and the system of Broberg and Morrey). All correlations were significant at p < 0.05.
Categorical rankings: Substantially lower agreement was observed when the systems were compared on the basis of categorical rankings. The overall kappa coefficient across all five scoring systems was 0.27 (95 per cent confidence interval, 0.23 to 0.31), which indicated only fair agreement. The quadratic weighted kappa coefficients ranged from 0.18 to 0.49, which indicated only slight-to-moderate agreement.
Construct Validity
Patient-rated severity (Table VI): Duncan post hoc analysis demonstrated that the system of Ewald et al. had good discriminant construct validity with respect to patient-rated severity. The spectrum of scores was wide (45.0 points; range, 50.5 to 95.5 points), and the scores fell into four groups in a logical order. The Mayo index also was highly discriminating, with the scores falling into four groups. The system of Broberg and Morrey and that of The Hospital for Special Surgery were less discriminating, with the scores falling into only three groups. The system of Pritchard had the worst discriminant construct validity: the spectrum of scores was narrow (24.9 points; range, 71.9 to 96.8 points), and the scores fell into only two groups. Analysis of variance demonstrated significant differences across the patient-rated levels of severity (p < 0.05).
Physician-rated severity (Table VI): As the physician (D. C. T.) rated only two patients as having very severe impairment, the categories of severe and very severe were combined for the analysis. The Mayo elbow-performance index and the systems of Pritchard, Ewald et al., and Broberg and Morrey demonstrated excellent discriminant construct validity with respect to physician-rated severity: the scores fell into four groups, which corresponded to the four levels of physician-rated severity. The system of The Hospital for Special Surgery was less discriminating: the spectrum of scores was narrow (38.1 points; range, 46.1 to 84.2 points), and the scores fell into only two groups. Analysis of variance demonstrated significant differences across the physician-rated levels of severity (p < 0.05).
Other measures (Table VII): The scores for function derived from the visual-analog scale were only moderately correlated with the aggregate scores according to the elbow-scoring systems (Pearson product-moment correlation coefficients, 0.44 to 0.66), and the correlation between the pain scores derived from the visual-analog scale and the aggregate scores according to the elbow-scoring systems was slightly lower (Pearson product-moment correlation coefficients, 0.35 to 0.61). The scores derived from both of the patient-completed questionnaires were correlated more closely with the visual-analog-scale scores than were the scores according to the elbow-scoring systems. The scores derived from the elbow-scoring systems, except for the system of The Hospital for Special Surgery, were correlated more closely with those according to the patient-completed functional questionnaires (Pearson product-moment correlation coefficients, 0.55 to 0.75) than they were with the function and pain scores derived from the visual-analog scales (Pearson product-moment correlation coefficients, 0.35 to 0.66). The system of Ewald et al. consistently ranked first or second in the testing of construct validity.
Test-Retest Reliability of the Functional Questionnaires
The intraclass correlation coefficient for the DASH questionnaire was 0.92, indicating excellent reliability. The coefficient for the M-ASES questionnaire was slightly lower (0.79) but was still within what was considered to be an acceptable level of reliability. The Pearson product-moment correlation coefficients were 0.92 and 0.78 for the DASH and M-ASES questionnaires, respectively.
A valid, reliable, and widely accepted method of reporting results following operations on the elbow is needed. The lack of a widely accepted outcome measure can lead to confusion when one attempts to determine the severity of impairment. Systems that are based on categorical rankings compartmentalize results into several categories (or domains) that clinicians use in decision-making. Accordingly, such scoring systems may be more appealing to clinicians than other types of outcome measures, such as patient-completed functional questionnaires. However, the admixture of clinical and functional criteria can create a confusing array of variables, and the questioning of a patient by an observer introduces the possibility of observer bias during the collection and interpretation of data.
When the content of a scoring system is reduced to a categorical ranking, the domains that have been assessed are no longer apparent. Some would argue that the use of categorical rankings simplifies the reporting of results. Unfortunately, several systems are used widely even though they differ with regard to the domains that they assess and the weightings employed to derive the categorical rankings. Moreover, the categorical rankings in current elbow-scoring systems were derived in an arbitrary fashion without input from patients. To our knowledge, the present study is the first objective analysis comparing different elbow-scoring systems in the same cohort of patients.
The five scoring systems were remarkably different with regard to the aspects of elbow function that they were designed to assess. All five systems were designed to assess pain and motion. Four systems were designed to assess the ability to perform specific tasks. Only three systems were designed to assess strength, even though it generally is recognized that pain substantially influences strength, as demonstrated by the reflex inhibition of muscular activity and the negative motivational influence of pain during strength-testing. Only two systems were designed to assess stability of the elbow. Another two systems were designed to assess deformity but not stability. This wide variability limits the interchangeability of the scoring systems.
The systems also were markedly different with regard to the relative weights that were assigned to the different domains. Pain received 30 to 50 points; motion, 10 to 28 points; strength, 0 to 25 points; stability, 0 to 10 points; function, 0 to 30 points; and deformity, 0 to 12 points.
Similar variation was seen with regard to the number and definition of the categorical rankings that were used in the various systems. One system had five rankings, three had four, and one had three. Such differences occasionally led to remarkably different categorical rankings for the same patient. For instance, one patient—a forty-two-year-old right-hand-dominant man who had the maximum value of ten centimeters on the visual-analog scale for pain, a range of flexion and extension of 20 to 95 degrees, a ten-kilogram grip strength on the involved side (compared with a thirty-two-kilogram grip strength on the contralateral side), good stability of the elbow, slight disability (a score of 2.5 points on the DASH questionnaire), and very severe impairment (according to the physician)—received a rating of good with the system of Pritchard and the Mayo elbow-performance index, a rating of fair with the system of Broberg and Morrey, a rating of poor with the system of Ewald et al., and a rating of failed with the system of The Hospital for Special Surgery.
There were marked differences among the five systems with regard to the distribution of the categorical ratings, even though the same group of patients was examined at the same time. Specifically, 11 to 45 per cent of the sample received a rating of excellent; 13 to 41 per cent, a rating of good; 21 to 39 per cent, a rating of fair; and 14 to 42 per cent, a rating of poor.
Despite the many differences among the systems, there was a surprising trend toward agreement when the systems were compared on the basis of the raw aggregate scores. The correlations between the raw scores were good. However, only slight-to-moderate correlation was noted when the systems were compared on the basis of categorical rankings19,28. For instance, the correlation between the raw aggregate scores derived from the system of Pritchard and the system of The Hospital for Special Surgery was good (Pearson product-moment correlation coefficient, 0.79), but the correlation was only slight (quadratic weighted kappa coefficient, 0.18) when the categorical rankings derived from those systems were compared. This finding is understandable given the variance in the criteria for the different categories. The differences among the scoring systems are so pervasive that the categorical rankings cannot be relied on to provide meaningful comparisons either within the same cohort of patients or between cohorts. Therefore, the results of studies that are based on the categorical rankings of different scoring systems cannot be compared or combined. However, the findings of the present study suggest that existing scoring systems could be used for comparative studies if investigators reported results as raw scores rather than as categorical rankings. This approach would require additional evaluation of the validity and interpretability of the raw scores.
Duncan post hoc analysis revealed the system of Pritchard to be the least discriminating: the scores fell into only two groups, with a narrow spectrum of values. The system of The Hospital for Special Surgery and the system of Broberg and Morrey were more discriminating, with the scores falling into three groups. The system of Ewald et al. and the Mayo index were the most discriminating: the scores fell into four groups, with a wide spectrum of values. The variable admixture of clinical and functional criteria probably contributed to the inconsistent discriminatory abilities of the five scoring systems. No score derived from the elbow-scoring systems was correlated as closely with the visual-analog-scale functional score as the scores from the DASH and M-ASES questionnaires were. The score from the M-ASES questionnaire was moderately correlated with the pain score even though that instrument focuses exclusively on functional limitations. This close correlation confirms the pervasive influence of pain on function.
The scores that had been derived from both of the patient-completed functional questionnaires were correlated more closely with patient-rated function than any of the elbow scores were, and the scores from the DASH questionnaire were correlated more closely with patient-rated pain than any of the elbow scores were. Both questionnaires had acceptable test-retest reliability. Such questionnaires can be completed by telephone or mail, do not require a physical examination, and are used to derive raw scores rather than categorical rankings. Patient-completed functional questionnaires such as the DASH questionnaire have been developed with careful attention to psychometric principles of instrument design. Such outcome measures can be reliable, can discriminate between severity of functional impact, can be sensitive to change over time, and can be statistically valid2,3. The results of the present study suggest that such questionnaires should be used to assess functional outcome following the treatment of disorders involving the elbow.
There is evidence that patient-completed questionnaires should be used in combination with generic health-outcome measures for assessing the outcome of procedures involving the shoulder5. We recognize that the age of the patient, the diagnosis, the use of medication, and comorbidity can influence the scores obtained with patient-completed functional questionnaires. As we performed a cross-sectional comparative study, however, we did not analyze the effect of these variables on the functional scores. In the present study, both the DASH and M-ASES questionnaires, which are designed to assess the function of the entire limb, performed as well as or better than the elbow-scoring systems in assessing the pain and functional loss perceived by the patients. The inclusion of such patient-completed functional questionnaires in future investigations would allow and possibly encourage the comparison of results between studies. Furthermore, the use of outcome measures that are designed to assess the function of the entire limb may help to determine the relative impact of disorders affecting various anatomical sites in the upper extremity.
We observed a remarkable lack of agreement when the categorical rankings of five different elbow-scoring systems were applied to the same cohort of patients. The correlations between the raw aggregate scores were better. On the basis of these findings, we believe that outcomes should be expressed as raw scores rather than as categorical rankings. Aggregate scoring systems will continue to appeal to clinicians and may be valid indicators of the severity of impairment of the elbow. However, it must be recognized that comparisons between studies that are based on different scoring systems are not valid and that the categorical rankings of different systems are not interchangeable. An ideal tool for the assessment of the elbow would measure pain, function, and disability simultaneously and accurately. At the present time, observer-based aggregate scoring systems seem to be reliable for the assessment of the clinical aspects of impairment of the elbow. Unfortunately, the variable admixture of clinical and functional criteria and the use of categorical rankings impair their validity. Patient-completed functional questionnaires can be valid and reliable instruments for the assessment of the elbow and are not limited by observer bias. Because the clinical result is important to the clinician, the outcome of therapies designed for the treatment of the elbow should be described on the basis of a patient-derived assessment of function, a clinical examination, and an assessment of pain.
NOTE: The authors thank Dr. M. D. McKee for providing patients who were included in the study. Ms. Beaton was supported by a Ph.D. Fellowship from the Medical Research Council of Canada during the course of the study.
Bartko, J. J.: The intraclass correlation coefficient as a measure of reliability. Psychol. Rep.,19: 3-11, 1966.193
1966
[PubMed]
Beaton, D. E., and Richards, R. R.: Measuring function of the shoulder. A cross-sectional comparison of five questionnaires. J. Bone and Joint Surg.,78-A: 882-890, June 1996.78-A882
1996
Beaton, D. E., and Richards, R. R.: Evaluating the reliability and responsiveness of five shoulder questionnaires. Unpublished data.
Beaton, D. E.; Dumont, A.; Mackay, M. B.; and Richards, R. R.: Steindler and pectoralis major flexorplasty: a comparative analysis. J. Hand Surg.,20A: 747-756, 1995.20A747
1995
Bergner, M., and Rothman, M. L.: Health status measures: an overview and guide for selection. Ann. Rev. Pub. Health,8: 191-210, 1987.8191
1987
Bohannon, R. W.: Test-retest reliability of hand-held dynamometry during a single session of strength assessment. Phys. Ther.,66: 206-209, 1986.66206
1986
[PubMed]
Broberg, M. A., and Morrey, B. F.: Results of delayed excision of the radial head after fracture. J. Bone and Joint Surg.,68-A: 669-674, June 1986.68-A669
1986
Deyo, R. A.; Andersson, G.; Bombardier, C.; Cherkin, D. C.; Keller, R. B.; Lee, C. K.; Liang, M. H.; Lipscomb, B.; Shekelle, P.; Spratt, K. F.; and Weinstein, J. N.: Outcome measures for studying patients with low back pain. Spine,19(18S): 2032-S2036, 1994.19(18S)2032
1994
Donner, A., and Eliasziw, M.: Sample size requirements for reliability studies. Statist. Med.,6: 441-448, 1987.6441
1987
Ewald, F. C.; Scheinberg, R. D.; Poss, R.; Thomas, W. H.; Scott, R. D.; and Sledge, C. B.: Capitellocondylar total elbow arthroplasty. Two to five-year follow-up in rheumatoid arthritis. J. Bone and Joint Surg.,62-A: 1259-1263, Dec. 1980.62-A1259
1980
Figgie, M. P.; Inglis, A. E.; Mow, C. S.; and Figgie, H. E., III: Total elbow arthroplasty for complete ankylosis of the elbow. J. Bone and Joint Surg.,71-A: 513-520, April 1989.71-A513
1989
Fleiss, J. L.: The Design and Analysis of Clinical Experiments. New York, John Wiley and Sons, 1986.
Gajdosik, R. L., and Bohannon, R. W.: Clinical measurement of range of motion. Review of goniometry emphasizing reliability and validity. Phys. Ther.,67: 1867-1872, 1987.671867
1987
[PubMed]
Gerber, C.: Integrated scoring systems for the functional assessment of the shoulder. In The Shoulder: A Balance of Mobility and Stability, p. 531. Edited by F. A. Matsen, III, F. H. Fu, and R. J. Hawkins. Rosemont, Illinois, The American Academy of Orthopaedic Surgeons, 1993.
Hudak, P. L.; Amadio, P. C.; and Bombardier, C.: Development of an upper extremity outcome measure: the DASH (Disabilities of the Arm, Shoulder and Hand). The Upper Extremity Collaborative Group (UECG). Am. J. Indust. Med.,29: 602-608, 1996.29602
1996
Huskisson, E. C.; Jones, J.; and Scott, P. J.: Application of visual-analogue scales to the measurement of functional capacity. Rheumatol. and Rehab.,15: 185-187, 1976.15185
1976
Kirshner, B., and Guyatt, G.: A methodological framework for assessing health indices. J. Chron. Dis.,38: 27-36, 1985.3827
1985
[PubMed]
Lachin, J. M.: Introduction to sample size determination and power analysis for clinical trials. Controlled Clin. Trials,2: 93-113, 1981.293
1981
[PubMed]
Landis, J. R., and Koch, G. G.: The measurement of observer agreement for categorical data. Biometrics,33: 159-174, 1977.33159
1977
[PubMed]
Law, M.: Measurement in occupational therapy: scientific criteria for evaluation. Canadian J. Occup. Ther.,54: 133-138, 1987.54133
1987
Modabber, M. R., and Jupiter, J. B.: Current concepts review. Reconstruction for post-traumatic conditions of the elbow joint. J. Bone and Joint Surg.,77-A: 1431-1446, Sept. 1995.77-A1431
1995
Morrey, B. F.; An, K. N.; and Chao, E. Y. S.: Functional evaluation of the elbow. In The Elbow and Its Disorders, edited by B. F. Morrey. Ed. 2, pp. 86-97. Philadelphia, W. B. Saunders, 1993.
Pritchard, R. W.: Total elbow arthroplasty. In Joint Replacement in the Upper Limb, p. 67. London, Mechanical Engineering Publications, 1977.
Richards, R. R.; An, K.-N.; Bigliani, L. U.; Friedman, R. J.; Gartsman, G. M.; Gristina, A. G.; Iannotti, J. P.; Mow, V. C.; Sidles, J.; and Zuckerman, J. D.: A standardized method for the assessment of shoulder function. J. Shoulder and Elbow Surg.,3: 347-352, 1994.3347
1994
Riddle, D. L.; Rothstein, J. M.; and Lamb, R. L.: Goniometric reliability in a clinical setting. Shoulder measurements. Phys. Ther.,67: 668-673, 1987.67668
1987
[PubMed]
Scott, J., and Huskisson, E. C.: Graphic representation of pain. Pain,2: 175-184, 1976.2175
1976
[PubMed]
Shrout, P. E., and Fleiss, J. L.: Intraclass correlations: uses in assessing rater reliability. Psychol. Bull.,86: 420-428, 1979.86420
1979
[PubMed]
Soeken, K. L., and Prescott, P. A.: Issues in the use of kappa to estimate reliability. Med. Care,24: 733-741, 1986.24733
1986
[PubMed]
Stratford, P.; Agostino, V.; Brazeau, C.; and Gowitzke, B. A.: Reliability of joint angle measurement: a discussion of methodology issues. Physiother. Canada,36: 5-9, 1984.365
1984