Preoperative pain and function are important predictors of outcome
after total hip and knee arthroplasty1.
Because there are no recognized levels of pain or functional deterioration
that are used as precise indications for total knee arthroplasty,
patients undergo the operation at varying levels of disease severity2-6. Failure to adjust for
preoperative status when evaluating the outcome of total knee arthroplasty
may lead to overestimation or underestimation of the effects of
the operation7.
Some study designs, such as cross-sectional and retrospective designs,
do not include the collection of preoperative data and often rely
on patients’ recall of their preoperative status8. Mancuso and Charlson7 analyzed the accuracy of patient
recall of preoperative status after total hip arthroplasty. The
patients in that study were recruited preoperatively and then were
evaluated at a mean of 2.5 years later, at which time they were
asked to recall their preoperative status. The patients were found
to have poor recall of pain, function, and impact on health but
moderate recall of walking ability. They tended to recall more pain
and better function, but the direction and magnitude of recollection
error varied for major subgroups of patients.
In the present study, we aimed to analyze patient recall of preoperative
status by surveying patients three months after total knee arthroplasty,
with no other interventions performed during the time between the
two assessments. We also evaluated whether there was a systematic
bias that could be used to adjust recall data in order to accurately
derive preoperative pain and functional status. We hypothesized
that the patients’ recall of preoperative pain and functional
status at three months after the procedure would have weak agreement
with the pain and functional status that they had reported preoperatively.
Design
Data for this analysis were obtained as part of the Kinemax Outcomes
study, a prospective observational study of the outcome of total
knee arthroplasty conducted at four centers in the United States,
six centers in the United Kingdom, and two centers in Australia.
The appropriate institutional review board or ethical committee
approved the study at each of the participating centers. Independent
research assistants at the participating sites recruited patients
from September 1997 to December 1998.
Patients
The patients who were included in the study were scheduled to
undergo a primary unilateral total knee arthroplasty for the treatment
of osteoarthritis. Patients were excluded if they had had bilateral
total knee arthroplasty within the previous twelve months, if they
were unable to complete the questionnaire, if they had a history
of knee joint infection, or if they had undergone prior knee reconstructive
surgery.
Data Collection
Independent research assistants recruited eligible patients and used
a uniform documentation system to collect data on the clinical history
and the results of a physical examination. Patient questionnaires
were administered preoperatively (within six weeks before the operation)
and three months postoperatively. The lead author (E.A.L.) trained
all research assistants to standardize data collection. Data were
entered into a single database at the coordinating center.
Data Elements
The clinical history was specifically reviewed with regard to previous
orthopaedic surgical procedures on the lower limb, pain, and functional
ability. The patient questionnaire included specific questions on
function, including walking distance and stair-climbing ability.
From these data, the Knee Society clinical rating system was used
to derive a pain score and a function score9,10.
To derive the pain score, the evaluator rates the patient’s knee
pain on a single seven-category scale that corresponds
to a score between 0 points (severe pain) and 50 points (no pain). The
function score allots 50 points for walking distance and 50 points
for stair-climbing ability. A score of 100 points indicates that
the subject is able to climb stairs normally and able to walk an
unlimited distance without an aid. Points are deducted if the patient
uses a walking aid.
The questionnaire also included questions about demographic characteristics,
such as age, gender, and race, and socioeconomic variables, such
as income and education. Two health-status instruments were used:
the Western Ontario and McMaster University Osteoarthritis Index
(WOMAC)11, which is a disease-specific
health-status instrument that was designed for patients with osteoarthritis
of the hip and knee, and the Medical Outcomes Study Short Form-36
Health Survey (SF-36)12-14,
which is a general health-status instrument that assesses both the
mental and the physical domain of health in several contrasting
ways. WOMAC scores were transformed to a scale of 0 to 100 points
for each domain (best score, 100 points). Different official versions
of the WOMAC were used for patients in Australia, the United States,
and the United Kingdom15. The
standardized method of calculating the SF-36 domains was
used so that each of the eight subscales had a score of 0 to 100
points (best score, 100 points). The SF-36 was first developed
from responses of a patient cohort in the United States and has
been subsequently validated for use in populations of patients in
Australia and the United Kingdom16,17.
Three months after the knee arthroplasty, the WOMAC and SF-36
were administered again and the patients were also asked to recall
their preoperative status on selected items from the WOMAC pain
scale and the SF-36 physical function scale. The two items
from the WOMAC scale asked the patients to recall how much pain
they had had while (1) walking on a flat surface and (2) going up
or down stairs. Patients rated their pain with use of a Likert-type
scale with five responses: none, mild, moderate, severe, or extreme.
The six items from the SF-36 scale asked patients to recall how
much limitation they had had during the following activities: (1) vigorous
activities, (2) moderate activities, (3) climbing one flight of
stairs, (4) walking >1 mi (1.6 km), (5) walking 100 yd
(91.4 m), and (6) bathing and dressing. Patients rated their function
with use of a Likert-type scale with three responses: not limited
at all, limited a little, or limited a lot.
Analysis of the Data
Statistical analyses were performed with use of the SAS statistical
package (SAS Institute, Cary, North Carolina)18.
The kappa statistic was used to measure agreement between individual
items. This test evaluates whether the amount of agreement is greater
than that expected by chance alone. A kappa coefficient of 1 indicates
perfect agreement, and a coefficient of 0 indicates that the responses
are completely independent (equal to the agreement expected by chance
alone)19. Because disagreements
in patients’ recall varied in magnitude, there was a sufficient
sample size to calculate a weighted kappa that gives partial credit
for discrepant ratings20. A weighted
kappa of <0.4 indicates poor agreement; 0.4 to 0.75, moderate-to-good
agreement; and >0.75, excellent agreement19.
The McNemar test of symmetry was used to measure whether disagreement
in one direction was equal to disagreement in the other direction;
that is, whether patients whose postoperative rating disagreed with
their preoperative rating tended to recall more or less pain and/or
more or less limitation in functional activities three months after
the operation than they had reported preoperatively21.
In the clinical research literature, WOMAC and SF-36
scores are most commonly reported in the form of summary scales. We
therefore sought to determine the extent to which patient recall
influences these summary scores. Recalled pain and function
summary scores were defined by calculating the means of
the recall responses for pain and function items and then transforming
them to a 0 to 100-point scale (best score, 100 points). Prospective
pain and function summary scores were defined by calculating
the means of the preoperative ratings of pain and function for the
same items and then transforming them with use of the same algorithm.
Summary scores were calculated only if the patient had answered
all of the pain items and at least four of the function items. The recalled
and prospective summary scores were then correlated and expressed
with use of the nonparametric Spearman coefficient. All p values
were two-tailed.
We further analyzed subgroups of patients by age, gender, country,
educational level (dichotomized as less than high school or as high
school or more), SF-36 mental health scores, and whether
patients were better or worse in terms of the WOMAC function score
three months after the operation. The WOMAC function score has been
shown to be a highly responsive measure for detecting change in
functional status after total knee arthroplasty, and the relative
efficiency of that measure is greater than that of several traditional
measures of surgical outcome1,11.
In these subgroups, the strength of correlations between prospective
and recalled scores were compared with use of the Fisher test of
equality of two correlations for different samples.
The proportion of scores that varied by more than 10 points (10% of
the total range), equivalent to half of one standard deviation,
was calculated for all subgroups. Effect size is calculated as the
change (or difference) in a score divided by the standard deviation
of the score. A difference of half of one standard deviation corresponds
to an effect size of 0.5 and is considered to be a large difference22. For the outcome measures used here,
the standard deviation was approximately 20 points (possible score
range, 0 to 100 points), and a difference of 10 points between the
prospective and recalled scores was approximately equivalent to
half of one standard deviation. That is, if we were to look for
the impact made by an independent variable, we would expect a difference
of 10 points to demonstrate a significant difference between groups
due to this variable. Additionally, it has been shown that changes
of 9 to 12 points on WOMAC scales are perceptible to patients with
knee osteoarthritis23. Therefore,
scores that differ by this range or more represent a clinically
important difference. The prospective and recalled summary scores
also were compared in terms of the strength of their correlations
with other data that were collected preoperatively, including the
Knee Society pain and function scores and the patients’ reports
of how far they were able to walk with and without support.
A total of 862 patients were recruited, and recall data regarding
preoperative pain and function were collected from 770 patients
(89%) three months after the operation. Of the ninety-two
patients who did not complete the questionnaire, twelve had died,
five were unable to continue in the study due to other illness,
two had had a revision of the knee, eighteen had withdrawn from
the study, one had moved away, two were lost to follow-up,
ten were unable to attend the three-month assessment, and
forty-two attended the assessment and had a clinical examination
but did not complete the patient recall questions. The mean age
of those who completed the study was seventy years (range, thirty-eight
to ninety years), and 59% of the patients were female.
Half of the patients were from the United Kingdom, 30% were
from the United States, and 20% were from Australia. Fifty-five
percent of the patients had less than a high-school education. Only
13% of the patients were still working, and most (70%)
were retired. Most (61%) of the patients were married.
Table I represents
the cross-tabulation of the patients’ prospective and
recalled ratings of pain while walking on a flat surface. The highlighted
diagonal region indicates the proportion of patients whose preoperative
and postoperative responses agreed (51.9%), the area above
the diagonal indicates the proportion of patients who recalled more
pain (31.3%), and the area below the diagonal indicates
the proportion of patients who recalled less pain (16.8%).
Table II represents
the cross-tabulation of the patients’ prospective and
recalled ratings of functon while walking for a distance of >1
mi. The highlighted diagonal region again represents the proportion
of patient responses that agreed (75.2%). It is important
to note that because the responses are clustered in the "limited
a lot" category, the data in the table are highly unbalanced.
Overall, there was poor agreement between the prospective and
recalled ratings for most items (weighted kappa, 0.20 to 0.39; 95% confidence
interval, 0.10 to 0.44); only one item, functional limitation during
bathing and dressing, demonstrated fair agreement (weighted kappa,
0.41; 95% confidence interval, 0.35 to 0.46) (Table III). The weighted
kappa values for the pain-related items were 0.37 and 0.39, although
only 6.2% and 6.0% of the responses varied by
more than one category when recalled ratings were compared with
prospective ratings. The weighted kappa values for the functional
items also indicated poor-to-fair agreement (range, 0.20 to 0.41),
and only 3.5% to 5.3% of the responses varied
by more than one category when recalled ratings were compared with
prospective ratings. For two of the functional items, vigorous activities
and walking >1 mi, the percentage agreement was high (86.5% and
75.2%, respectively), even though the weighted kappa value
was low (0.20 and 0.33, respectively). This paradox is due to the
fact that the data in these tables are highly unbalanced, and therefore
the kappa statistic is not the most informative method of analysis.
Patients whose postoperative rating disagreed with their prospective
rating tended to recall more pain than they had reported preoperatively
(Fig. 1).
The McNemar test indicated that patients recalled significantly
more pain for walking on a flat surface (p < 0.001). The recall
errors for the functional items were more random, with patients
tending to recall less limitation for vigorous and moderate activities
and walking >1 mi but more limitation for climbing one
flight of stairs and walking 100 yd (Fig. 2). The McNemar test indicated that
patients recalled significantly less limitation for walking >1
mi (p < 0.001) but significantly more limitation for walking
100 yd (p = 0.009).
We explored various reasons for the paradoxical finding that patients
recalled less limitation for walking >1 mi and more limitation
for walking 100 yd. The preoperative data on walking distance with
and without support were correlated much more strongly with the
prospective function summary scores than with the recalled function
summary scores (Table IV). Preoperatively, only about 15% of
the patients could walk >1 mi but 65% could walk >100
yd. Three months postoperatively, 45% of the patients were
able to walk >1 mi and 85% were able to walk >100
yd. We also looked at whether their walking distance improved, stayed
the same, or got worse and whether this change correlated with changes
in their perception of how limited they were preoperatively. This
analysis did not enable us to draw conclusions as to why patients’ perceptions
of walking limitation were recalled in such a varied fashion.
These pain and function-related items are rarely reported individually
and are more commonly reported as summary scales. Therefore, summary
scales were derived for pain and function, and then the prospective
summary scores were correlated with the recalled summary scores.
There was only moderate correlation between these scores, as indicated
by a Spearman correlation coefficient of 0.53 for pain and 0.48
for function (Table V).
In addition, 61% of the pain scores and 50% of
the function scores varied by >10 points (10% of
total range) when the prospective and recalled scores were compared.
Subgroup analysis demonstrated that patients whose WOMAC function
scores had deteriorated at three months after the operation (eighty-nine
patients; 10% of those for whom such data were available)
had significantly poorer recall of function (p = 0.02).
In this subgroup, 78% of the recalled pain scores and 60% of
the recalled function scores varied by >10 points from
the prospective scores. Patients who were seventy-five years of
age or older (285 patients; 33% of those for whom such
data were available) had significantly poorer recall of both pain
(p = 0.04) and function (p = 0.038). Patients
whose educational level was defined as high school or more (375
patients; 44% of those for whom such data were available)
also had significantly poorer recall of pain (p = 0.03)
(Table V).
Patients were grouped as having either high or low mental health
on the basis of their three-month SF-36 mental health score.
A score of £60 points was used to indicate poor mental health;
this score is equivalent to the seventy-fifth percentile of
a population-based group of patients with a diagnosis of clinical
depression24. The mean SF-36
mental health score (and standard deviation) at three months after
the operation was 75.6 ± 17.4 points, and
16.7% of the patients had a score of <60 points.
The high-mental-health and low-mental-health groups had similar
recall of pain, but the low-mental-health group had significantly
worse recall of function (p = 0.04). There were no significant
differences in the strength of correlations due to gender or country
(Table V).
The pain and function summary scores were also correlated with
pain and function data that had been collected preoperatively (Table IV). The Knee
Society pain score (the preoperative pain rating assigned by the
research assistants) had a significantly stronger correlation with
the prospective pain score than with the recalled pain score (p < 0.001).
Functional items that were assessed preoperatively, such as walking
distance and the Knee Society function score, had significantly
stronger correlations with the prospective function score than with
the recalled function score (p < 0.001 for all comparisons).
Changes in functional status (calculated as the three-month score
minus the preoperative score) were examined with use of either the
prospective or the recalled preoperative status. A higher proportion
of patients made greater improvements in both pain and function
when the recalled preoperative scores were used (Fig. 3).
The present study demonstrated poor-to-fair weighted kappa values
(=0.41) for all of the individual pain and
function items. There was a significant trend for patients to recall
more pain, but there was random recollection error for the function
items. This result is similar to the findings of Mancuso and Charlson7. The usefulness of the weighted kappa
statistic was limited because the data on most of the items that
were analyzed in this study were highly unbalanced, leading to a
paradox between a high percentage agreement but a low kappa value. This
was especially evident in the items concerning vigorous activities
and walking >1 mi, for which the percentage agreement was
86.5% and 75.2%, respectively, and the weighted kappa
value was 0.20 and 0.33, respectively.
The advantage of using a kappa statistic to evaluate agreement between
prospective and recall data rather than just reporting the percentage
agreement (the proportion of values that are an exact match) is
that the kappa coefficient adjusts for the amount of agreement expected
to occur by chance alone. Patients’ ratings of their preoperative
status will always tend to be clustered at the more severe end of
the pain scale and the more limited end of the function scale. Therefore,
cross-tabulation of prospective and recalled responses
of preoperative status will have marginal totals (that is, row and
column totals) that are much larger at the more severe end of the
table. Consequently, these data will always be unbalanced, making interpretation
of the kappa score difficult because the marginal totals have such
a strong influence on how this statistic is calculated25. Analysis of the summary scales
is more useful for researchers who rely on the use of recall data
to derive preoperative status.
Correlations between prospective and recalled summary scores
were only moderate, and a large proportion of scores differed by >10
points (equivalent to more than half of one standard deviation).
Additionally, patients who had comparatively worse WOMAC function
scores three months after the operation had significantly poorer
recall of function. Older patients (seventy-five years
of age or older) and patients with low mental health (an SF-36
mental health score of <60 points) also had poorer recall
of function. Patients who had completed high school or had more
education and older patients had significantly poorer recall of
pain.
When the prospective and recalled preoperative scores were compared
with other scores that had been collected preoperatively by the
research assistants and from the patients’ self-reports,
we found that the prospective scores consistently demonstrated stronger
correlation than the recalled scores did. Thus, using retrospective
recall of preoperative status to calculate a patient’s
change in symptoms or health status is not as accurate as using
the differences recorded in a prospective study and is at best only
a surrogate for the actual measurement of symptoms before and after
treatment.
Fortin et al.1 found that preoperative
pain and function as measured with the WOMAC and SF-36
were strong predictors of outcome after total knee arthroplasty
when these instruments were administered again six months after
surgery. The respective preoperative scores alone explained 25% of
the variation in WOMAC pain scores at six months and 21% of
the variation in SF-36 physical function scores at six
months. Because of the strong influence of preoperative status on
outcome, it is essential to use caution when interpreting the results
of studies that rely on recall data to derive preoperative status.
Some allowances must be made for the high level of variation to ensure
that the results reported are not overestimating or underestimating
the effectiveness of total knee arthroplasty.
Psychometric research has consistently shown that the retrospective
recall of a change in symptoms or health status is not as accurate
as the difference recorded in a prospective study8.
The subject’s memory of his or her health status is required, and
this memory is flawed by experiences that have occurred in the interval.
However, retrospective recall is a very useful measure when the
goal is to assess what the subject believes about the effects of
treatment. Patients rate their level of pain and functional disability
according to their internal standard. A response shift may occur
over time due to a variety of experiences that the patient has had
during this interval, and the internal standard that the patient
has used previously may be recalibrated26,27.
Depending on the research question being studied, recall information
can be the most appropriate data on which to base conclusions about
treatments.
Our study had several strengths. The recruitment of a large study
cohort enabled us to exclude patients who had had bilateral surgery
within the previous twelve months, ensuring that patients’ attention
would be focused on the index knee. The large study cohort also
enabled us to analyze the different subgroups of patients without
loss of statistical power. Patients were reviewed only three months
after total knee arthroplasty, which markedly decreased the intervening
health issues that might have influenced their responses if they
had been examined a year or more later.
Our study also had some limitations. We asked only limited recall
questions at three months after the procedure as we did not wish
to overburden patients. Therefore, we lost some of the scaling properties
for our summary scores. Also, we did not ask patients how accurately
they believed that they could recall their preoperative symptoms.
It has been shown that when recall is used it is essential to ascertain
how accurately the patients believe that they have recalled their
preoperative status8.
Researchers who use recall data to derive preoperative status must
take into account the fact that this is not a direct substitute
for prospectively collected data. Because of the variability in
recall data, there is the possibility that the effectiveness of
total knee arthroplasty may be overestimated or underestimated.
In our study, use of the kappa statistic to evaluate the level of
agreement between prospectively collected data and recalled preoperative
status was of little value because the data were highly unbalanced.
To make valid conclusions from their analysis, researchers using
the kappa statistic to report on agreement must ensure that the
data are balanced.
Note: The Kinemax Outcomes Group includes William Gillespie,
Colin Howie, Ian Annan, Judith Lane (Princess Margaret Rose Hospital,
Edinburgh, Scotland); Ian Pinder, David Weir, Karen Bettinson (Freeman
Hospital, Newcastle upon Tyne, England); Maurice Needhoff, Roz Jackson (King’s
Mill Clinic, Mansfield, England); Tim Wilton, Peter Howard (Derbyshire
Royal Infirmary, Derby, England); Ian Forster, Paul Szyprt, Chris
Moran, David Whitaker, Mike Bullock, Zena Hinchcliffe (Queen’s
Medical Centre, Nottingham, England); Ian Learmonth, John Newman,
Chris Ackroyd, George Langkamer, Robert Spencer, Mark Shannon, Evert Smith,
John Dixon, Sarah Whitehouse (Avon Orthopedic Centre, Bristol, England);
Clement Sledge, Frederick Ewald, Robert Poss, John Wright, Scott
Martin, John Kwon, Yvette Valderrama (Brigham and Women’s
Hospital, Boston, MA); Steven Harwin, Michael Lichardi (Beth Israel
Medical Center, New York, NY); Mark Mehlhoff, Linda Weiler, Tom Cahalan
(Iowa Medical Clinic, Cedar Rapids, IA); Richard Cronk, Allyson
Sandago (Neuromuscular and Joint Center, Corvallis, OR); Stephen
Rackemann, Emma McLaughlin (The Knee Centre, Gold Coast, Australia);
and Peter Lewis, Robert Bauze, Jane Clasohm (Queen Elizabeth Hospital,
Adelaide, Australia).