Abstract: Total knee arthroplasty and total hip arthroplasty are 2 of the most commonly performed elective orthopaedic procedures. They are remarkably successful in relieving pain and improving function in individuals with advanced, symptomatic arthritis. Since, in addition to providing benefits, these procedures pose risks, it is important to provide clinicians with guidance in determining which patients should undergo total joint replacement surgery. The development of the RAND approach in 1986 and its application to total hip and knee replacement have enabled clinicians, payers, and others to assess the appropriateness of past and current procedures for particular patients. However, current appropriateness criteria for elective orthopaedic procedures have important limitations that suggest that they be used cautiously. New approaches to the assessment of appropriateness that overcome many of these limitations are under development.
Over 1,000,000 individuals undergo primary total knee arthroplasty or total hip arthroplasty annually in the U.S., and over 1,200,000 total knee arthroplasties are performed annually worldwide1-3. Total knee and hip arthroplasty are associated with pain relief and functional improvement in >80% of cases4. While total knee arthroplasty and total hip arthroplasty are among the most cost-effective interventions in medicine, the expenditure on these procedures exceeds $10 billion annually in the U.S., and billions more worldwide5,6. This substantial investment has prompted considerable interest in assessing the appropriateness of total knee joint arthroplasty, as well as other elective orthopaedic procedures7.
“Appropriate” is defined generically as “suitable to a particular situation.”8 In the medical context, a procedure or treatment is considered appropriate if the expected health benefit of the treatment (increased life expectancy, pain relief, functional improvement, etc.) exceeds the expected adverse consequences (death, complications, anxiety, etc.) by a wide enough margin that the intervention is worth doing9. This definition, articulated by Brook et al. in 19869, has anchored thinking about medical appropriateness for decades.
It is useful to distinguish “appropriateness” from concordance with “treatment guidelines,” which are developed to help clinicians make decisions and are supported by the medical literature10. The American Academy of Orthopaedic Surgeons (AAOS), the Osteoarthritis Research Society International (OARSI), the American College of Rheumatology (ACR), the European League Against Rheumatism (EULAR), and the National Institute for Health and Care Excellence (NICE) all provide guidelines for treatment of knee osteoarthritis, including suggestions for the use of total knee arthroplasty11-15. For example, these guidelines advise physicians to recommend that patients with advanced radiographic knee osteoarthritis and substantial pain and activity limitation consider total knee arthroplasty. However, treatment guidelines are not designed to be used to determine whether any particular patient should undergo total knee arthroplasty, or whether total knee arthroplasty was an appropriate treatment in cases that have already been performed. This level of specificity is the province of appropriateness criteria, which have traditionally been developed to yield determinations about whether it is reasonable to carry out a particular treatment for a particular patient in a particular clinical circumstance, given inevitable trade-offs between short-term and long-term risks and benefits16.
Historical Considerations of Appropriateness
Several trends prompted intense interest in the appropriateness of medical treatments in the last decades of the twentieth century. One was growing concern about patient safety and the recognition that hospitalizations and procedures carry nontrivial risks of complications and death, which was documented in To Err Is Human: Building a Safer Health Care System17. Another salient concern was the steady increase in health-care costs. Policy-makers reasoned that since medical procedures use scarce resources and pose serious risks of complications and death, the procedures should be deemed appropriate before they are carried out. In addition, research demonstrated striking geographic variations in the rates of many discretionary procedures, including total joint arthroplasty, discectomy, laminectomy, and others18-21. Many scholars interpret these marked variations in surgery rates from one geographic region to another as evidence that clinicians in these areas are unsure of the appropriate indications for surgery. Finally, concerning total joint replacement, the utilization of these procedures has increased steadily over the last 3 decades and is projected to increase further, with the fastest increase in younger persons (<55 years old), who have the highest risk of needing a revision procedure22-24. These trends have raised the question of whether or not all of these procedures are appropriate25.
Measurement of Appropriateness
Brook and colleagues at RAND addressed the challenge of measuring appropriateness in the 1980s9. The RAND approach, summarized in Figure 1, remains the dominant paradigm in the field, although approaches that address some of the limitations of the RAND methodology are under development and will be discussed below26. The first objective in the RAND process is to identify patient characteristics that should be incorporated into appropriateness criteria. A team of researchers initiates this process by performing a rigorous, comprehensive review of pertinent literature to identify variables that are associated with the outcome of the procedure. In the case of orthopaedic procedures, these might include level of pain, adequacy of nonoperative treatment, or severity of radiographic findings26,27. The results of the literature review are provided to an expert clinician panel whose members use the literature evidence and their own clinical experience to finalize the list of variables that should be incorporated into the appropriateness criteria. The panel is generally composed of clinical experts in the relevant specialty field. We note that the composition of panels is influential because ratings may differ systematically, depending on the specialties of the clinicians on the panel28.
Next, these variables are grouped into case scenarios that represent every possible combination of the different values of each variable. To illustrate this process, let us assume that an expert panel recommends that 5 variables are used to determine the appropriateness of total knee arthroplasty, and that each variable is binary (e.g., adequate versus inadequate nonoperative therapy; severe pain versus milder pain). These 5 binary variables can be combined into 25 (32) distinct scenarios. (In reality, most criteria are more complex and involve hundreds of distinct scenarios.) The research team develops 32 clinical vignettes representing each of the possible combinations of the 5 key indicators. These vignettes are presented to the panel, and each member of the panel of expert clinicians rates each scenario from 1 (entirely inappropriate) to 9 (entirely appropriate). The ratings of individual scenarios are summarized across the panel. If the median rating of the panelists for a particular scenario is in the 1 to 3 range, the specific indication is interpreted as inappropriate. If the median rating is in the 4 to 6 range, it is regarded as uncertain; a median rating of 7 to 9 is considered appropriate. In this manner, the methodology produces an ordinal measure of appropriateness (ranging from 1 to 9), but can ultimately yield a 3-level determination of whether a procedure is appropriate, uncertain, or inappropriate. The test-retest reliability of the total hip and total knee replacement criteria rating by the same panel after 1 year was excellent, with kappa values of 0.78 (total knee arthroplasty) and 0.81 (total hip arthroplasty)29.
Table I provides scenarios that illustrate the effect of particular clinical variables on appropriateness determinations using criteria developed with use of the RAND approach. Use in an individual <55 years old with severe knee pain and Kellgren-Lawrence (K-L) grade-4 findings in a single compartment, such as the medial tibiofemoral compartment, would be deemed uncertain. If 2 compartments were involved (e.g., the medial and the lateral tibiofemoral, or the medial and the patellofemoral), the determination would be appropriate. Similarly, use in an individual ≥55 years old with severe pain, normal range of motion, and K-L grade-3 radiographic findings would be rated uncertain. If the person had restricted motion, or a K-L grade-4 rating instead of a K-L grade-3 rating, the determination would be appropriate.
How Are Appropriateness Criteria Applied?
Clinicians, investigators, payers, policy-makers, and others can use appropriateness criteria prospectively (to assess the appropriateness of a proposed intervention) or retrospectively (to assess whether it was indeed appropriate to carry out procedures that were already performed). Prior authorization and second-opinion programs are familiar applications of the prospective paradigm. In these scenarios, a treatment (e.g., total knee arthroplasty) is being considered for a particular patient. A payer organization uses appropriateness criteria to determine whether it will pay for the proposed procedure. Because the process of developing appropriateness criteria is time-intensive and resource-intensive, the payer organization will generally use existing appropriateness criteria rather than develop them anew.
Retrospective applications permit evaluation of the appropriateness of procedures that have already been carried out. For example, as part of a quality improvement effort, an organization may wish to assess the appropriateness of 100 consecutive cases of total knee arthroplasty that have previously been carried out by a particular group of orthopaedic surgeons. The organization is likely to use an existing set of appropriateness criteria that have been rated by an expert panel, such as the Spanish criteria30,31. The organization would review medical records to evaluate the binary appropriateness indicators for each patient. As discussed above, each case presents a specific combination of the indicators, for which the panel had assigned a median rating from 1 to 9. These ratings are summarized across the cohort of 100 patients. If, for example, the median panel ratings for 10% of the cases fell in the 0 to 3 range, 20% fell in the 4 to 6 range, and 70% fell in the 7 to 9 range, we would conclude that 10% of cases were inappropriate, 20% were uncertain, and 70% were appropriate.
The prospective and retrospective applications of appropriateness criteria noted above are intended to influence utilization at the individual level. Investigators also have examined whether the introduction of appropriateness criteria can influence utilization at larger population levels. In 1 study (outside of the musculoskeletal arena), use of percutaneous coronary interventions declined in the 5 years following the dissemination of appropriate-use criteria developed by the American College of Cardiology in 200932. In particular, the number of cases that were deemed inappropriate declined substantially.
Use of Appropriateness Criteria in Total Hip and Knee Replacement Surgery and in Spine Procedures
Several authors have published retrospective evaluations of the appropriateness of total knee replacement, total hip replacement, and lumbar laminectomy. These evaluations are summarized in Table II and Figure 2. Each of these studies used appropriateness criteria developed with the RAND approach. Table II demonstrates the individual criteria used to determine appropriateness in these studies. The arthroplasty studies used age, measures of pain, prior therapy, radiographic severity, and localization of pathology (e.g., the particular knee compartments involved). Data on appropriateness criteria for spine procedures are also provided. These criteria share many of the features used in arthroplasty studies, but also include neurological findings and the presence of psychological impairment, disability claims, and active litigation, among others. Although we focus primarily on total joint replacement procedures in this review, we note that appropriateness criteria have been developed for treatment or diagnostic evaluation of other musculoskeletal conditions, including ankle fracture, rib fracture, and osteoporotic vertebral compression fracture33-35.
According to the studies of the appropriateness of total joint replacement and spine procedures, 14% to 49% of cases were deemed inappropriate, while 27% to 48% were deemed appropriate (Fig. 2). The striking proportion of inappropriate determinations raises the question of whether surgeons are performing too much surgery, or whether the RAND approach to measuring appropriateness has important limitations that should be addressed in a new generation of appropriateness criteria.
Limitations of the Appropriateness Methodology
Existing appropriateness criteria have important limitations, beginning with the “shelf life” during which a set of appropriateness criteria can be considered relevant. Recall that the RAND process is resource-intensive, involving the engagement of a team of researchers, a panel of clinical experts, and a face-to-face meeting of the panel. These steps require time, funding, and infrastructure. Since updating the criteria is similarly resource-intensive, the criteria are updated infrequently. For example, the total knee arthroplasty criteria used in the studies of Escobar et al. and Riddle et al. were developed using the RAND methodology by an expert panel in Northern Spain around 200030,31. The application of these criteria by Riddle et al. involved patients in the U.S. Osteoarthritis Initiative who had undergone total knee arthroplasty throughout 2008 to 2014. Between the criteria development in 2000 and the ensuing 8 to 14 years, techniques, risks, and ultimately surgical indications evolved. The demographic features of a patient undergoing elective orthopaedic procedures continued to evolve as well. A much greater proportion of total knee replacement recipients, for example, are <65 years old presently compared with 2 decades ago24,36. These patients tend to present at an earlier phase in the trajectory of functional decline, and they may require a greater level of physical activity to induce pain, yielding less-severe pain scores. Given that pain severity figures prominently in appropriateness criteria, these factors may lead to a greater number of “inappropriate” procedures as judged by the 15-to-20-year-old criteria.
Table I gives some examples of sets of clinical characteristics deemed inappropriate by the criteria used in these studies of joint arthroplasty. A patient who is <55 years old with moderate pain and K-L grade-4 unicompartmental (e.g., medial compartment) osteoarthritis would be considered a reasonable candidate for total knee arthroplasty by most surgeons today, but appropriateness was viewed as “uncertain” by the Spanish panel in 2000. Similarly, the patient with moderate pain and K-L grade-4 medial compartment osteoarthritis would be considered a reasonable candidate for total knee arthroplasty, but was deemed inappropriate by the Spanish panel. In the last 15 years, evidence has mounted that, on average, patients who undergo surgery with advanced pain and functional limitation have worse levels of pain and function in the months and years following total knee arthroplasty and total hip arthroplasty, prompting earlier intervention37. Furthermore, risks of infection and revision have diminished gradually in the last 2 decades, altering the ratio of the risks and benefits of performing surgery. Thus, many cases considered inappropriate 15 years ago would likely be considered appropriate by a panel today.
Another set of limitations concerns the absence of the patient’s perspective on appropriateness. A panel of clinicians might regard total knee arthroplasty as inappropriate in a patient with moderate pain, moderate functional limitation, normal range of motion, and unicompartmental involvement. Yet, some patients with this constellation of findings would readily accept the risk of complications in order to resume an active lifestyle. Although the patient experiences the effects, both positive and negative, of the procedure, the patient’s viewpoint has not been included in previous studies of appropriateness30,31,38. Thus, the lack of the patient perspective in most appropriateness criteria is a major limitation. A related concern is whether the imposition of appropriateness measures could exacerbate existing racial and ethnic disparities in utilization of total joint replacement. Evidence to date suggests that African Americans and Caucasians with advanced knee osteoarthritis are equally likely to have appropriate indications for surgery, and that referral to a surgeon is driven by appropriate clinical indications and not by race39. These data are reassuring, but the potential for appropriateness ratings to widen disparities merits continued evaluation.
Finally, appropriateness involves a determination that a procedure is “worth doing” (to quote Brook again), implying a judgment about whether the investment in total knee arthroplasty in a particular clinical setting is justified9. Yet, the perspectives of payers, insurers, and policy-makers who ultimately need to make these decisions have not been incorporated into the process of developing appropriateness criteria.
These limitations—the limited shelf life, absence of patient preferences, and lack of input from stakeholders that bear the costs of appropriateness determinations—suggest that appropriateness determinations using existing criteria should be interpreted cautiously, and that appropriateness criteria need to be regularly revisited and updated. Fortunately, a number of efforts are underway worldwide to reconceptualize and update appropriateness criteria. This work introduces multiple perspectives, including those of patients, payers, and a wider range of clinicians. To highlight an example, 1 of the authors (G.H.) is leading a Canadian effort to incorporate both patient and surgeon perspectives into a new generation of appropriateness criteria. The criteria that panels of Canadian patients and orthopaedic surgeons have agreed upon are shown in Table III. Of note, the criteria do not emphasize radiographic criteria or range of motion, features that figured prominently in the Spanish criteria (Table I) but have only modest associations with pain, function, and quality of life—the outcomes that matter most to patients. These more contemporary efforts to gain consensus among panel members also take advantage of Internet-based applications that make the process less time-consuming and less resource-intensive.
Appropriateness and Outcomes
Data from Quintana et al.40 document that total knee arthroplasty cases rated as inappropriate had less improvement in pain over follow-up than cases rated as appropriate (Fig. 3). Riddle et al.31 made similar observations in a U.S. cohort. These findings suggest that if improvement in pain is the metric used for success, patients with inappropriate indications had worse outcomes. However, the data in Figure 3 also show that patients for whom total knee arthroplasty was rated as inappropriate ultimately attained the best pain levels of the 3 groups. If the investigators had used the final pain score (rather than the change in score) as the determinant of outcomes, those with the “least appropriate” surgeries would have experienced the best outcomes. Thus, the decision of whether to use improvement (the “journey”) or final status (the “destination”) as the measure of outcome will determine whether “inappropriate” cases are likely to do well or poorly. Given these findings, it would seem premature and arbitrary to conclude that “inappropriate” cases have poor outcomes. To our knowledge, the relationships between newer generations of appropriateness criteria that incorporate patient perspectives and the outcomes of total joint replacement have not yet been studied26.
In conclusion, clinical guidelines help to advise clinicians regarding treatment decisions, but the determination of whether a particular treatment is appropriate to carry out in a particular patient requires a metric with greater specificity. Appropriateness ratings were developed for this purpose. The process of developing these ratings has traditionally been so resource-intensive that few organizations have the resources to develop their own criteria or to update them. Consequently, criteria such as those developed in Northern Spain in the late 1990s using the RAND methodology are still used to adjudicate the appropriateness of cases performed presently across the globe. We would suggest that because these criteria were developed decades ago by physician panels without explicitly incorporating patient perspectives, the criteria should be used with caution, if at all, to judge the appropriateness of today’s cases. We anticipate that a new generation of appropriateness criteria will overcome many of the limitations of existing criteria, and we look forward to the development, dissemination, and ultimate adoption of these new criteria.
Investigation performed at Brigham and Women’s Hospital, Boston, Massachusetts
Disclosure: No external funding was received for this study. The Disclosure of Potential Conflicts of Interest forms are provided with the online version of the article.
- Copyright © 2017 by The Journal of Bone and Joint Surgery, Incorporated