Abstract
Background: In clinical trials, use of patient recall data would be beneficial when the collection of baseline data is impossible, such as in trauma situations. We investigated the ability of older patients to accurately recall their preoperative quality of life, function, and general health status at six weeks following total hip arthroplasty.
Methods: We randomized consecutive patients who were fifty-five years of age or older into two groups. At each assessment, patients completed self-report questionnaires (at four weeks preoperatively, on the day of surgery, and at six weeks and three months postoperatively for Group 1 and at six weeks and three months postoperatively for Group 2). At six weeks postoperatively, all patients completed the questionnaires on the basis of their recollection of their preoperative health status. We evaluated the validity and reliability of recall ratings, the degree of error in recall ratings, and the effects of the use of recall data on power and sample size requirements.
Results: A total of 174 patients (mean age, seventy-one years) who were undergoing either primary or revision total hip arthroplasty were randomized and included in the analysis (118 patients were in Group 1 and fifty-six were in Group 2). Agreement between actual and recalled data was excellent for disease-specific questionnaires (intraclass correlation coefficient, 0.86, 0.87, and 0.88) and moderate for generic health measures (intraclass correlation coefficient, 0.48, 0.58, and 0.60). Increased error associated with recalled ratings compared with actual ratings necessitates minimal increases in sample size or results in small decreases in power.
Conclusions: Patients undergoing total hip arthroplasty can accurately recall their preoperative health status at six weeks postoperatively.
Level of Evidence: Therapeutic Level I. See Instructions to Authors for a complete description of levels of evidence.
Patient self-ratings of quality of life, general health, and functional status are often considered one of the preferred methods of evaluating the effects of orthopaedic surgical interventions. In conducting clinical trials, researchers often measure the baseline health status of patients to demonstrate similarities between two groups prior to surgery and to adjust for any differences at baseline in the analysis of outcome data, increasing the power to demonstrate important between-group differences.
The process of baseline data collection can be difficult (in some cases, it means an additional visit for patients or coordination with the preadmission staff), costly, and time-consuming. Frequently, a substantial proportion of patients who appear to meet the study eligibility criteria prior to surgery prove to be ineligible following surgical examination. For example, when a unicompartmental knee arthroplasty is performed, a fair number of patients (up to 25%) could potentially be excluded from receiving the procedure because of an unanticipated involvement of the lateral or patellofemoral compartments. There may also be situations when collection of baseline data is impossible. Following an acute injury, the patient is not seen prior to surgery and therefore preoperative baseline data cannot be obtained.
If patients can accurately recall their preoperative quality of life following surgery, it seems reasonable to substitute recalled ratings of baseline health status for prospectively collected baseline ratings. This would result in more efficient use of research staff resources and would greatly decrease the patient burden, as only the patients found eligible for participation in the study would complete baseline assessments.
A study by Bryant et al.1 found that patients undergoing knee arthroscopy with or without anterior cruciate ligament reconstruction were able to accurately recall their preoperative health status two weeks after surgery. The mean age of the patients was forty years. It is not clear, however, whether these results can be generalized to an older group. Furthermore, recall assessments done at six weeks following surgery (the usual time when patients return to their surgeon after a hip arthroplasty for the first postoperative visit) may be less accurate than recall assessments collected two weeks postoperatively.
The purpose of the present study was to determine whether patients who are fifty-five years of age or older and undergoing primary or revision total hip arthroplasty can accurately recall their preoperative quality of life, general health, and functional status at six weeks postoperatively.
Study Design
This was a prospective, randomized clinical trial with five orthopaedic surgeons participating in patient recruitment. Ethics approval was obtained from the Health Sciences Research Ethics Board at the University of Western Ontario. Patients who were scheduled for primary or revision total hip arthroplasty were contacted at least four weeks prior to surgery to determine their willingness to participate in the study. Consenting patients were randomly allocated into one of two groups. At each assessment, patients were asked to complete several self-report questionnaires, including disease-specific quality-of-life, general health, and functional status instruments. Group 1 underwent assessment at four weeks preoperatively, on the day of surgery, and at six weeks and three months postoperatively. Participants allocated to Group 2 underwent assessment at six weeks and at three months postoperatively (Fig. 1).
At six weeks after surgery, patients in both groups were provided with two sets of questionnaires. Patients were unaware that they would be receiving two sets of questionnaires and that they would be asked to recall their preoperative health status. For the first set of questionnaires, the patient was asked to recall his or her quality of life, general health, and function during the period four weeks prior to surgery and to complete the questionnaires according to that recall. After completion of the recalled version, the patient was then asked to assess the current quality of life, general health, and functional status over the past four weeks. The six-week time point was selected because it is the usual time when a patient returns for the first postoperative visit, and it therefore represents a time that would be the least burdensome for patients and research staff to complete baseline assessments should recall be shown to be sufficiently accurate. Finally, at three months after surgery, each patient completed the questionnaires to assess current quality of life and health status during the previous two weeks (Fig. 2).
In anticipation of the possibility that patients who completed the questionnaires on previous visits would actually remember their previous responses (producing agreement statistics between actual and recalled ratings that were falsely inflated), we randomly assigned patients into one of two groups: one group that would complete the questionnaires before being asked to recall and one group that would not. Since patients were randomly assigned to groups, participants in Group 1 and Group 2 were assumed to be similar and therefore should have similar recall ratings. If experience with the questionnaires influences the patient's ability to recall, two possible outcomes may be observed: (1) a systematic difference between groups for the recall ratings only (if the majority of patients with prior experience rate themselves as having better or worse health), or (2) a greater variability between patients within Group 1 for recalled ratings, which may be evidenced by heterogeneous variances between groups (if having prior experience causes some patients to overestimate prior health while others underestimate prior health). By randomizing patients into two groups, we were able to examine the effect of previous exposure to the questionnaires on recall.
Eligibility Criteria
Patients included in the study were fifty-five years of age or older and were undergoing either primary or revision hip replacement for the treatment of osteoarthritis. Participants who had been currently enrolled in clinical trials that used similar questionnaires were excluded to decrease any learning effect. We also excluded patients undergoing minor procedures, those with rare diseases or conditions who would not normally be invited to participate in research studies, and patients with no fixed address or who would not be able to complete the questionnaires because of major psychiatric illness, cognitive impairments, or an inability to speak or understand English.
Randomization was stratified by surgeon (five surgeons were involved in recruitment) and the type of surgery being performed (primary or revision) to balance potential prognostic characteristics between groups. The randomization sequence was constructed with use of a computer algorithm with permuted block sizes of three and six. To ensure adequate concealment of allocation, the study coordinator established patient eligibility and obtained verbal consent prior to randomization.
Outcome Measures
Questionnaires included the Lower Extremity Functional Scale, the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), the Oxford hip score, the Short Form-12 health survey (SF-12), and the feeling thermometer.
The Lower Extremity Functional Scale2,3 is a twenty-item, region-specific, quality-of-life questionnaire for patients with lower limb disorders. A change in score of at least 9 points is considered a clinically important difference. The Lower Extremity Functional Scale has face validity and has demonstrated construct validity, reliability, and responsiveness2,3.
The WOMAC4-6 is a twenty-four-item, disease-specific questionnaire. The index consists of twenty-four questions, divided into three domains: pain, stiffness, and difficulty with physical function. The WOMAC is extensively used and has been shown to be a valid, reliable instrument that is sensitive to change4-6. A change in score of 9 to 12 points has been shown to be a clinically important difference among patients with osteoarthritis6.
The Oxford hip score7-9 is a twelve-item, disease-specific, quality-of-life questionnaire designed specifically for patients undergoing total hip replacement to capture joint arthroplasty outcomes. This instrument has face validity, has demonstrated construct validity and reliability, and is sensitive to change7-9. Work is still in progress to develop the minimal clinically important difference for the Oxford hip score; however, it is expected to be between 3 and 5 points10.
The SF-1211 is a twelve-item generic health instrument that evaluates eight domains including restrictions or limitations on physical and social activities, normal activities and responsibilities of daily living, pain, mental health and well-being, and perceptions of health. The SF-12 has been used extensively and has been shown to be valid, reliable, and responsive in a wide variety of populations and contexts, including patients with orthopaedic conditions11. It is generally accepted that the minimal clinically important difference for the SF-12 ranges from 3 to 5 points12.
The feeling thermometer13-15 is a visual analogue scale presented in the form of a thermometer with 100 intervals, ranging from the best state, which is full health (a score of 100), to the worst state, which is death (a score of 0). This instrument has face validity, it has demonstrated construct validity and reliability, and it has also shown responsiveness to change. The minimal clinically important difference on the feeling thermometer is estimated to be between 5 and 8 points13-15.
All questionnaires were transformed to a 100-point scale, with a higher score indicating better functioning or higher quality of life, for ease in comparison across scores.
At each assessment, the patients were asked to consider their quality of life, health, and functional status during the past four weeks. For the recall assessment, we asked patients to consider their health status during the four weeks prior to surgery and to respond to the questionnaires on the basis of that recall. Therefore, we considered the ratings provided on the day of surgery to be the gold, or criterion, standard of the preoperative health of the patient; thus, recalled ratings collected six weeks postoperatively should accurately predict ratings provided on the day of surgery. It was also assumed that both time points measure the same construct and would therefore have high agreement or reliability.
Sample Size
To provide estimates of agreement between the recall and actual data, the appropriate calculation to determine sample size requirement is one that allows us to estimate a parameter (0.85) with a prespecified level of precision (a 95% confidence interval no wider than 0.10). Using sample size calculations for estimating a parameter16, we needed 111 participants in Group 1. To ensure adequate power (80%) for a between-group comparison of the recalled and current ratings between Group 1 and Group 2 to assess the similarities between groups addressing our second objective, we required approximately fifty patients per group, given a type-I error rate of 0.05. Thus, we randomized patients using a 2:1 randomization schedule (Group 1:Group 2).
Statistical Methods
Our first objective was to determine the reliability and the validity of recalled ratings. We assumed that, if valid, recalled ratings collected six weeks postoperatively would accurately predict ratings provided on the day of surgery. Second, we assumed that both the day-of-surgery and six-week recalled assessments were measuring the same construct and would therefore have high agreement or reliability. Finally, we wished to determine the amount of error in the recall ratings, including both the total variance (between-subject and within-subject variability and random error) as well as individual measurement error.
To determine the validity of the recalled ratings, we performed a linear regression to determine the ability of the patients' recalled data to predict the day-of-surgery ratings for each of the questionnaires. We then constructed scatterplots of the data with 95% prediction lines to explore the variability (between-subject and within-subject) and agreement between the two ratings at the group and individual levels.
To determine the reliability, we conducted a repeated-measures analysis of variance to determine whether there were any significant systematic differences between the day-of-surgery ratings and the recall ratings. To estimate the magnitude of the association between recalled and actual preoperative data, an intraclass correlation coefficient for each instrument and its 95% confidence interval was constructed. Intraclass correlation coefficient values of =0.75 are considered as excellent, values between 0.4 and 0.75 as moderate, and values of =0.4 represent poor agreement17.
The intraclass correlation coefficient provides information about the total variance (between-subject and within-subject variability and random error), whereas the standard error of measurement expresses individual measurement error only, without the influence of variance among patients18. Therefore, we also calculated the standard error of measurement and its 95% confidence intervals.
Our second objective was to determine whether there was a significant difference between Group 1 and Group 2 with regard to the mean scores and variances of data collected six weeks postoperatively. We compared the recalled ratings between participants in Group 1 and Group 2 using t tests to determine whether the two prior exposures to the questionnaires of participants in Group 1 (four-week preoperative and day-of-surgery ratings) had a systematic influence on their recalled ratings. We assumed that because patients were randomized into groups, participants in both groups would be similar with respect to their baseline health status and that, if accurate recall was possible, the scores for the recalled data would also be similar between groups. Further, to determine if there were any significant differences between the variances of ratings between groups, we used the Levene test for equality of variances19,20, where a significant test (p < 0.05) indicated unequal variances.
Generally in a test-retest situation, patients with a stable disease are asked to complete self-assessments on two separate occasions, and differences in scores between the two occasions are usually attributed to random error, but they may actually consist of both random error and any true change in health status. Since one of the assessments in our study took place on the day of surgery, we hypothesized that a third source of error might arise from anxiety or nervousness on the day of surgery. If accurate recall is possible, then random error will be the only source of error, suggesting that agreement between recalled and actual ratings should be higher than that observed in a test-retest situation. Therefore, our third objective was to calculate the agreement between ratings from the day of surgery and four-week-preoperative data and compare them with the agreement between the day of surgery and recall data (Group 1). To determine the test-retest reliability of patient ratings provided at the preadmission appointment and on the day of surgery, we constructed an intraclass correlation coefficient for each instrument as well as scatterplots with 95% group and individual level prediction intervals.
Finally, for many investigators, the purpose of collecting baseline data is to use the data as a covariate when testing for significant differences between groups. In order for a covariate to contribute to a reduction in the unknown variance, it must have a correlation to the outcome of interest that is >0. Therefore, the next set of objectives was to determine the correlation between prospectively collected baseline data and three-month postoperative data. If the correlation is not >0, then collection of baseline data is not necessary for this population. If it is >0, our final objective was to compare the correlation between actual baseline and postoperative data with the correlation between retrospective baseline data and postoperative data and to assess the efficiency of using retrospective data by analyzing its effect on sample size and power.
To explore the efficiency of using recalled data in place of the prospectively collected baseline data, we used three common methods for making statistical comparisons between groups: (1) a t test of the posttest score only, (2) a t test of the change score (posttest—pretest), and (3) an analysis of covariance, with actual or recalled pretest scores used as the covariate. For all calculations, the probability of type-I error was maintained at 0.05 and the probability of type-II error, at 0.20. A difference of 20% of the mean preoperative score was considered an important difference.
Source of Funding
There was no external funding source for this study.
Patient Characteristics
We assessed the eligibility of 221 consecutive patients who were scheduled for either a primary or revision total hip arthroplasty. Of those, thirty-nine did not participate and eight were excluded (Fig. 1).
The mean age of the study participants was 70.6 years (range, fifty-five to ninety years), and the majority of patients (89%) were retired. Eighty-four percent of the patients underwent a primary total hip arthroplasty, while 16% underwent a revision. Patients in Group 1 and Group 2 were similar with regard to age, sex, operative hip, type of procedure being performed (primary or revision hip arthroplasty), and prevalence of previous hip surgery. Table I provides a detailed description of the demographic characteristics of the study participants.
Objective 1: Ability of Patients to Recall Preoperative Quality of Life and General Health Status
The mean differences between actual baseline ratings collected on the day of surgery and recalled ratings provided six weeks postoperatively were small across all questionnaires. Three of the differences were significant (a mean difference of -2.74 [95% confidence interval, -4.80 to —0.69] from 39.42 to 42.16 for the WOMAC; -2.83 [95% confidence interval, -4.39 to -1.27] from 25.03 to 27.85 for the SF-12 physical component score; and 5.06 [95% confidence interval, 1.68 to 8.43] from 58.16 to 53.11 for the feeling thermometer). However, these were not thought to represent a clinically meaningful difference (Table II). Patients tended to underestimate their preoperative ratings on the Lower Extremity Functional Scale, WOMAC, and SF-12 physical component score (i.e., they recalled a lower quality of life than they had actually reported preoperatively), whereas on the Oxford hip score, SF-12 mental component score, and feeling thermometer, the recalled ratings of the patients were overestimations (recalled a higher quality of life) compared with the actual baseline ratings.
Scatterplots of the recalled data compared with the day-of-surgery data for the patients in Group 1 were suggestive of high levels of agreement (Fig. 2). The data were also consistent with the assumptions of linear regression (linearity, normality, and homoscedasticity) as verified by the residual analysis.
Recalled ratings were a significant predictor of actual baseline ratings (p < 0.001) across all questionnaires (Table III). The Pearson correlation coefficient indicated excellent agreement between ratings for the region-specific measures (r = 0.86 for the Lower Extremity Functional Scale, r = 0.87 for the Oxford hip score, and r = 0.89 for the WOMAC). The correlation between actual and recalled ratings of the generic health measures was moderate (r = 0.62 for the SF-12 physical component score, r = 0.48 for the SF-12 mental component score, and r = 0.63 for the feeling thermometer) (Table III).
Similarly, the agreement between recalled ratings and day-of-surgery ratings was excellent across the disease-specific questionnaires (the intraclass correlation coefficient was 0.86 [95% confidence interval, 0.79 to 0.90] for the Lower Extremity Functional Scale, 0.87 [95% confidence interval, 0.81 to 0.91] for the Oxford hip score, and 0.88 [95% confidence interval, 0.82 to 0.92] for the WOMAC), whereas agreement for the generic health questionnaires was moderate (the intraclass correlation coefficient was 0.58 [95% confidence interval, 0.40 to 0.71] for the SF-12 physical component score, 0.48 [95% confidence interval, 0.30 to 0.62] for the SF-12 mental component score, and 0.60 [95% confidence interval, 0.43 to 0.72] for the feeling thermometer) (Table II).
The standard error of measurement was relatively small for both the disease-specific and generic health questionnaires (Table II), suggesting that the lower levels of agreement between the day-of-surgery ratings and the six-week postoperative recalled ratings of the generic health measures (the intraclass correlation coefficient was 0.58 for the SF-12 physical component score, 0.48 for the SF-12 mental component score, and 0.60 for the feeling thermometer) were due to smaller between-subject variability, or less heterogeneity in scores, rather than to a greater degree of error.
Scatterplots with lines showing the mean and individual 95% prediction intervals for the WOMAC (an example of large between-subject variability) and the mental component score of the SF-12 (an example of small between-subject variability) are presented in Figure 2. The SF-12 mental component scores of the patients in our study group fell within the middle part of the scale, indicating that they did not represent the entire range of scores possible for the SF-12 among the general population. The disease or region-specific questionnaires (Lower Extremity Functional Scale, WOMAC, and Oxford hip score) show a larger between-subject effect, representing a greater proportion of the possible scores among patients undergoing a hip arthroplasty, and therefore demonstrate greater between-subject variability, as displayed in the WOMAC.
We performed identical analyses substituting the four-week preoperative data in place of the day-of-surgery data and obtained similar results.
Objective 2: The Influence of Prior Exposure to Instruments on Ability to Recall
The independent samples t test comparison between Group 1 and Group 2 with regard to the recalled ratings of the participants (at six weeks postoperatively) was not significant for any of the questionnaires (see Appendix). The mean differences between Group-1 and Group-2 recalled ratings were small across all instruments, with only the WOMAC total score reaching significance with a mean difference of 7.45 (95% confidence interval, 1.02 to 13.87; p = 0.02), which suggests that previous exposure to the questionnaires causes patients to overestimate their actual baseline quality of life. This could be considered a spurious finding since the difference between groups did not reach significance for any of the other questionnaires. The variances of recalled ratings were also similar between groups across questionnaires, with only two differences reaching significance (the SF-12 physical component score [p < 0.01] and the feeling thermometer [p = 0.05]). This finding may be a result of being overpowered for this statistical test since variances were similar and certainly not different by a factor of four, which is considered a rule of thumb when considering the similarity of variances21 (see Appendix).
Objective 3: Test-Retest Reliability
Reliability between ratings provided at the four-week preoperative assessment and on the day of surgery were excellent across all questionnaires (an intraclass correlation coefficient of 0.97 for the Lower Extremity Functional Scale, 0.94 for the Oxford hip score, 0.96 for the WOMAC, 0.83 for the SF-12 physical component score, 0.91 for the SF-12 mental component score, and 0.94 for the feeling thermometer). Agreement between preoperative and day-of-surgery ratings was higher than agreement of actual and recall ratings across all instruments (see Appendix), suggesting that there is an additional source of error as a result of asking people to recall.
Objective 4: Effect on Sample Size and Power When Recall Data Used
For each questionnaire, the correlation between the day-of-surgery ratings and the three-month postoperative score did not differ significantly from the correlation between the recalled rating and three-month postoperative score. The correlation between actual preoperative ratings (day of surgery) and three-month postoperative ratings ranged from 0.42 to 0.54, whereas the correlation between recalled preoperative and three-month postoperative ratings ranged from 0.39 to 0.58 across questionnaires (see Appendix).
All sample size estimates that used recalled or actual data for a planned analysis of covariance were smaller than those that would be required for comparisons with use of a posttest-only score, whereas all of the calculations for sample size with use of recalled data for comparisons with use of change scores were greater than those that used a posttest-only score (see Appendix).
Similarly, the substitution of recalled ratings for prospectively collected baseline data has an impact on power. When change scores were used, the reduction in power estimates ranged from 0% to 14%. Also, when recalled ratings were used in place of actual baseline ratings for analysis of covariance statistical comparisons, power reductions ranging from 1% to 13% were estimated (see Appendix).
When an analysis of covariance statistical comparison was used, all estimates of power were greater than the 80% power of a planned posttest-only comparison (an increase in power of 7% to 11% if prospective baseline data were used or 1% to 9% if recalled ratings were used) and were greater than the change score power estimates, with an increase in power of 8% to 13% if actual data were used or an increase of 10% to 15% if recall data were used (see Appendix).
In summary, the use of recall data necessitates increased sample size requirements, in order to compensate for a loss in statistical power. The use of a more efficient statistical test (such as analysis of covariance) may overcome the inefficiency introduced by the use of recalled data.
We found that older patients who are undergoing total hip arthroplasty can accurately recall their preoperative quality of life, general health, and functional status at six weeks postoperatively. There are two dominant theories of memory that address the accuracy of patient recall: the response shift theory and the implicit theory of memory. The response shift theory is described as changes in an individual's health status that may produce behavioral, cognitive, and affective changes that may alter his or her standards, values, or conceptualization of health-related quality of life. This shift in one's ideas about health consequently influences the perceived quality of life22,23.
The implicit theory of memory24 suggests that individuals have a perception about the stability of their health status and about any conditions that might produce a change in their health, such as an intervention or surgery. Implicit theorists believe that recalling a previous state is difficult without contextual features to associate with the memory24,25. Without such a reference point, people begin their recollection by asking themselves how they are at the current time, followed by asking themselves how they think things have changed, and then infer their initial state24.
One factor thought to affect the accuracy of recall is the amount of time between the prospective and retrospective assessments. For example, researchers who asked patients to provide recalled ratings less than two weeks after an intervention found that patients can accurately recall their preoperative health status1,25-30, whereas the studies that used recalled patient ratings at two months or longer following an intervention did not find high agreement between the prospective and retrospective ratings31-39.
Although our results support the use of retrospective baseline data collection in this population, it would seem that recall data is most appropriate for group comparisons, as in a clinical trial, in which data are aggregated and then generalized to the population. However, if the clinician is interested in using a patient's preoperative health status to predict the outcome following surgery, our study demonstrates greater uncertainty in the ability to predict outcomes at the individual level when recalled ratings are used. This finding is illustrated in our scatterplots that present both the group and individual prediction lines (Fig. 2).
It is also important to consider the observed compared with the expected relationship between actual and recalled ratings. If we consider a test-retest situation in which patients with a stable disease are asked to complete a self-assessment on two separate occasions, differences in scores are attributed to random error and, to a much lesser degree, error due to true change. It is also possible that, because one of the assessments took place on the day of surgery, there may have been additional error due to anxiety or nervousness on the day of surgery. In a recall situation, however, if patients can accurately recall their prior health state, then random error should be the only source of error. If these assumptions are true, then we might expect the agreement in a recall situation to be higher than that observed in a test-retest situation. Our results, however, show that in fact the opposite is observed, suggesting that there is an additional source of error as a result of asking people to recall.
Finally, while the majority of studies that have investigated the accuracy of a patient's ability to recall preoperative health status did so for the purpose of conducting unplanned, retrospective studies, our purpose was to determine whether we could plan in advance to collect recalled ratings of quality of life, general health, and functional status in a prospective randomized trial, to improve the efficiency of data collection. Because "recall error" is present, it is important to investigate the effect of this error on sample size requirements or the power to make statistical comparisons at the end of the study. It was expected that recalled ratings would increase within-subject error, leading to a greater overall error, or variance, leading to larger estimates of sample size or a reduction in statistical power (an increase in type-II error rate). These expectations were confirmed by our results; recalled ratings did have greater associated variances.
If planning to use recall ratings in place of prospectively collected baseline data, researchers must decide whether the gains in efficiency through data collection (i.e., a reduction in patient burden and research staff resources at the front end of the study) are worth the increases in estimates of sample size or loss of power to make statistical comparisons at the study's conclusion. For example, in studies involving patients with rare diseases, it may be more feasible to expend resources in collecting baseline data prospectively, if possible, than to require a greater number of patients in the study.
Our findings highlight an important point about how analyses are conducted in clinical trials. When conducting an analysis involving the use of either a change score or an analysis of covariance, the magnitude of the association between the preoperative baseline and postoperative end-point scores is extremely important. The use of postoperative end-point scores has been described as inefficient by several authors40-45 if the magnitude of the correlation between baseline and final postoperative ratings is >0. For an analysis involving the use of change scores, the magnitude of this correlation needs to be at least 0.5 to avoid losses in statistical power42-44. For an analysis of covariance, statistical power is greater than that achieved with use of a change score or posttest-only score as soon as the correlation between preoperative and postoperative scores is >0, and this power increases as the strength of the association increases40,42,44.
Only two of the questionnaires (both general health questionnaires) in our study (SF-12 physical component score and the feeling thermometer) had a correlation of >0.5 between preoperative and postoperative scores, suggesting that a loss of power is probable if investigators are using change scores to evaluate outcomes. Moreover, since the correlation between preoperative and postoperative scores is >0 across all questionnaires, comparisons with use of postoperative scores will also have less power than comparisons with use of an analysis of covariance40-44,46,47.
The strengths of this study are that it was a large, randomized, controlled trial with multiple surgeons and with a wide variety of self-assessment instruments used to assess outcome (hip-specific, disease-specific, and generic health measures). Our sample included patients undergoing either a primary or a revision total hip arthroplasty. This is the first study to investigate recall specifically among older patients, with an average age of seventy-one years (range, fifty-five to ninety years), making the results generalizable to an older population.
Limitations of the present study include the generalizability of the results to other groups of surgical patients. It is possible that total hip arthroplasty served as an important event for this population, which provided patients with an anchor from which to judge their preoperative state. Therefore, perhaps in studies that involve less traumatic interventions, the patients would not have a sufficiently important event that would serve as a reference point from which to rate previous health. Further, since previous surgical studies have shown that time between actual assessments and recall assessments affects the ability to recall, interventions with less recovery time or no perceived recovery time may show evidence of response shift and/or implicit theory of memory at six weeks postoperatively.
In conclusion, patients undergoing total hip arthroplasty can recall their preoperative quality of life, general health, and functional status at six weeks postoperatively with sufficient accuracy to warrant substituting prospectively collected baseline data with retrospective ratings. In situations where a proportion of patients are being excluded after having completed baseline data, investigators can improve the efficiency of data collection in this patient group by asking eligible patients to recall their preoperative state. We have shown that, in this population, investigators can substitute recalled ratings for retrospective ratings with minimal expected loss of statistical power, given the use of an efficient statistical test.
Tables presenting mean scores on the questionnaires, an assessment of the similarities between Groups 1 and 2, an assessment of test-retest reliability, and an assessment of the effect of the use of recalled ratings on sample size and power are available with the electronic versions of this article, on our web site at jbjs.org (go to the article citation and click on "Supplementary Material") and on our quarterly CD/DVD (call our subscription department, at 781-449-9780, to order the CD or DVD). 
Note: The authors thank all of the patients who participated in this study for their time, patience, and cooperation throughout the trial. They thank Abigail Thompson for her assistance in the recruitment of participants and coordination of patient visits. Finally, they thank Dr. Douglas Naudie, Dr. Richard McCalden, Dr. James McAuley, and Dr. Robert Bourne for allowing us access to their patients.
Bryant D, Norman G, Stratford P, Marx RG, Walter SD, Guyatt G. Patients undergoing knee surgery provided accurate ratings of preoperative quality of life and function 2 weeks after surgery. J Clin Epidemiol.2006;59:984-93.59984
2006
[PubMed][CrossRef]
Binkley JM, Stratford PW, Lott SA, Riddle DL. The Lower Extremity Functional Scale (LEFS): scale development, measurement properties, and clinical application. North American Orthopaedic Rehabilitation Research Network. Phys Ther.1999;79:371-83.79371
1999
Watson CJ, Propps M, Ratner J, Zeigler DL, Horton P, Smith SS. Reliability and responsiveness of the lower extremity functional scale and the anterior knee pain scale in patients with anterior knee pain. J Orthop Sports Phys Ther.2005;35:136-46.35136
2005
Bellamy N, Buchanan WW, Goldsmith CH, Campbell J, Stitt LW. Validation study of WOMAC: a health status instrument for measuring clinically important patient relevant outcomes to antirheumatic drug therapy in patients with osteoarthritis of the hip or knee. J Rheumatol.1988;15:1833-40.151833
1988
Davies GM, Watson DJ, Bellamy N. Comparison of the responsiveness and relative effect size of the Western Ontario and McMaster Universities Osteoarthritis Index and the short-form Medical Outcomes Study Survey in a randomized, clinical trial of osteoarthritis patients. Arthritis Care Res.1999;12:172-9.12172
1999
[CrossRef]
Ehrich EW, Davies GM, Watson DJ, Bolognese JA, Seidenberg BC, Bellamy N. Minimal perceptible clinical improvement with the Western Ontario and McMaster Universities osteoarthritis index questionnaire and global assessments in patients with osteoarthritis. J Rheumatol.2000;27:2635-41.272635
2000
Dawson J, Fitzpatrick R, Frost S, Gundle R, McLardy-Smith P, Murray D. Evidence for the validity of a patient-based instrument for assessment of outcome after revision hip replacement. J Bone Joint Surg Br.2001;83:1125-9.831125
2001
[CrossRef]
Dawson J, Fitzpatrick R, Murray D, Carr A. Comparison of measures to assess outcomes in total hip replacement surgery. Qual Health Care.1996;5:81-8.581
1996
[CrossRef]
Pynsent PB, Adams DJ, Disney SP. The Oxford hip and knee outcome questionnaires for arthroplasty. J Bone Joint Surg Br.2005;87:241-8.87241
2005
[CrossRef]
Murray DW, Fitzpatrick R, Rogers K, Pandit H, Beard DJ, Carr AJ, Dawson J. The use of the Oxford hip and knee scores. J Bone Joint Surg Br.2007;89:1010-4.891010
2007
[CrossRef]
Ware J Jr, Kosinski M, Keller SD. A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. Med Care.1996;34:220-33.34220
1996
[CrossRef]
Drummond M. Introducing economic and quality of life measurements into clinical studies. Ann Med.2001;33:344-9.33344
2001
[CrossRef]
Puhan MA, Guyatt GH, Montori VM, Bhandari M, Devereaux PJ, Griffith L, Goldstein R, Schünemann HJ. The standard gamble demonstrated lower reliability than the feeling thermometer. J Clin Epidemiol.2005;58:458-65.58458
2005
[CrossRef]
Schünemann HJ, Griffith L, Stubbing D, Goldstein R, Guyatt GH. A clinical trial to evaluate the measurement properties of 2 direct preference instruments administered with and without hypothetical marker states. Med Decis Making.2003;23:140-9.23140
2003
[CrossRef]
Schünemann HJ, Griffith L, Jaeschke R, Goldstein R, Stubbing D, Guyatt GH. Evaluation of the minimal important difference for the feeling thermometer and the St. George's Respiratory Questionnaire in patients with chronic airflow obstruction. J Clin Epidemiol.2003;56:1170-6.561170
2003
[CrossRef]
Bonett DG. Sample size requirements for estimating intraclass correlations with desired precision. Stat Med.2002;21:1331-5.211331
2002
[CrossRef]
Stratford PW, Goldsmith CH. Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. Phys Ther.1997;77:745-50.77745
1997
Pitman EJG. A note on normal correlation. Biometrika.1939;31:9-12.319
1939
Snedecor GW, Cochran WG. Statistical methods. 6th ed. Ames, Iowa: Iowa State University Press; 1967.
1967
Schwartz CE, Sprangers MA. Methodological approaches for assessing response shift in longitudinal health-related quality-of-life research. Soc Sci Med.1999;48:1531-48.481531
1999
[CrossRef]
Cohen BH. Explaining psychological statistics. 2nd ed. New York: Wiley; 2001. p 258-87.p 258
2001
Sprangers MA, Schwartz CE. Integrating response shift into health-related quality of life research: a theoretical model. Soc Sci Med.1999;48:1507-15.481507
1999
[CrossRef]
Ross M. Relation of implicit theories to the construction of personal histories. Psychological Review.1989;96:341-57.96341
1989
[CrossRef]
Baddeley A. Human memory. Theory and practice. Revised ed. Hove, East Sussex, UK: Psychology Press; 1997.
1997
Babul N, Darke AC, Johnson DH, Charron-Vincent K. Using memory for pain in analgesic research. Ann Pharmacother.1993;27:9-12.279
1993
Singer AJ, Kowalska A, Thode HC Jr. Ability of patients to accurately recall the severity of acute painful events. Acad Emerg Med.2001;8:292-5.8292
2001
[CrossRef]
Zonneveld LN, McGrath PJ, Reid GJ, Sorbi MJ. Accuracy of children's pain memories. Pain.1997;71:297-302.71297
1997
[CrossRef]
Hunter M, Philips C, Rachman S. Memory for pain. Pain.1979;6:35-46.635
1979
[CrossRef]
Kreulen GJ, Stommel M, Gutek BA, Burns LR, Braden CJ. Utility of retrospective pretest ratings of patient satisfaction with health status. Res Nurs Health.2002;25:233-41.25233
2002
[CrossRef]
ten Klooster PM, Drossaers-Bakker KW, Taal E, van de Laar MA. Can we assess baseline pain and global health retrospectively? Clin Exp Rheumatol.2007;25:176-81.25176
2007
Pellisé F, Vidal X, Hernández A, Cedraschi C, Bagó J, Villanueva C. Reliability of retrospective clinical data to evaluate the effectiveness of lumbar fusion in chronic low back pain. Spine.2005;30:365-8.30365
2005
[CrossRef]
Everts B, Karlson B, Währborg P, Abdon N, Herlitz J, Hedner T. Pain recollection after chest pain of cardiac origin. Cardiology.1999;92:115-20.92115
1999
[CrossRef]
Aseltine RH Jr, Carlson KJ, Fowler FJ Jr, Barry MJ. Comparing prospective and retrospective measures of treatment outcomes. Med Care. 1995;33(4 Suppl):AS67-76.33AS67
1995
Lingard EA, Wright EA, Sledge CB, Kinemax Outcomes Group. Pitfalls of using patient recall to derive preoperative status in outcome studies of total knee arthroplasty. J Bone Joint Surg Am.2001;83:1149-56.831149
2001
[CrossRef]
Mancuso CA, Charlson ME. Does recollection error threaten the validity of cross-sectional studies of effectiveness? Med Care.1995;33(4 Suppl):AS77-88.33AS77
1995
Linton SJ, Melin L. The accuracy of remembering chronic pain. Pain.1982;13:281-5.13281
1982
[CrossRef]
Dawson EG, Kanim LE, Sra P, Dorey FJ, Goldstein TB, Delamarter RB, Sandhu HS. Low back pain recollection versus concurrent accounts: outcomes analysis. Spine.2002;27:984-94.27984
2002
[CrossRef]
Feine JS, Lavigne GJ, Dao TT, Morin C, Lund JP. Memories of chronic pain and perceptions of relief. Pain.1998;77:137-41.77137
1998
[CrossRef]
Elliott AM, Smith BH, Hannaford PC, Smith WC, Chambers WA. Assessing change in chronic pain severity: the chronic pain grade compared with retrospective perceptions. Br J Gen Pract.2002;52:269-74.52269
2002
Cronbach LJ, Furby L. How should we measure ‘change’ - or should we? Psychol Bull.1970;74:68-80.7468
1970
[CrossRef]
Knapp TR. The (un)reliability of change scores in counseling research. Meas Eval Guid.1980;13:149-57.13149
1980
Lee J. A note on the comparison of group means based on repeated measurements of the same subject. J Chronic Dis.1980;33:673-5.33673
1980
[CrossRef]
Norman GR. Issues in the use of change scores in randomized trials. J Clin Epidemiol.1989;42:1097-105.421097
1989
[CrossRef]
Oldham PD. A note on the analysis of repeated measurements of the same subjects. J Chronic Dis.1962;15:969-77.15969
1962
[CrossRef]
Stanek EJ III. Choosing a pretest-posttest analysis. Am Stat.1988;42:178-83.42178
1988
[CrossRef]
Frison L, Pocock SJ. Repeated measures in clinical trials: analysis using mean summary statistics and its implications for design. Stat Med.1992;11:1685-704.111685
1992
[CrossRef]
Egger MJ, Coleman ML, Ward JR, Reading JC, Williams HJ. Uses and abuses of analysis of covariance in clinical trials. Control Clin Trials.1985;6:12-24.612
1985
[CrossRef]