Systematic reviews that are performed with use of meta-analytic methods
provide a valid means of summarizing the results of studies and can help
clinicians to make informed clinical
decisions1-3.
The levels of evidence used in several orthopaedic journals rank meta-analysis
of high-quality randomized controlled trials as the highest available
evidence4,5.
Systematic reviews differ from traditional reviews in that they involve asking
focused clinical questions, conducting comprehensive literature searches,
assessing the quality of the included studies, and conducting searches and
data abstraction in duplicate. Meta-analysis is simply the statistical
combination of the results across studies. Recently, multiple reviews focusing
on the same topic have led to conflicting conclusions in the medical
literature6-8.
These discordances complicate decision-making by surgeons, patients, and
policy-makers6.
Differences in systematic reviews may, in part, be related to the quality
of the systematic reviews
themselves9. The
Quality of Reporting of Meta-analysis (QUOROM) statement was introduced in the
late 1990s to improve the reporting of the meta-analysis of randomized
controlled
trials10. This
statement is analogous to the Consolidated Standards of Reporting Trials
(CONSORT) statement, currently adopted by many orthopaedic
journals11. The aim
of the QUOROM checklist was to improve the quality of reporting but was not to
avoid duplicated systematic reviews. A checklist to evaluate the internal
validity of systematic reviews (i.e., methodological quality) was described
and validated by Oxman and
Guyatt12. This
index was designed to rate the scientific quality of research
overviews and has been used in published evaluations of systematic
reviews1,6,8,12.
Differences in systematic reviews also may be related to the inclusion of
different randomized trials. One poignant illustration comes from the
literature evaluating the optimal method for anterior cruciate ligament
surgery. Numerous randomized controlled trials comparing hamstring tendon
autograft with bone-patellar tendon-bone autograft for reconstruction of the
anterior cruciate ligament have been published over the past decade. Several
systematic reviews (and meta-analyses) have been published on this topic and
have differed in their design and results. Such conflicting reviews have led
to uncertainty among clinicians about the optimal operative approach for
anterior cruciate ligament repair.
The aim of the present review was to appraise the methodological quality
and the quality of reporting of systematic reviews on anterior cruciate
ligament reconstruction and to compare the quality of reporting with the
direction of the conclusions of the systematic review. We also aimed to
explore the reasons for differences among the overlapping systematic reviews
and to propose a guide to help in clinical decision-making when one is
confronted with overlapping discrepant systematic
reviews6.
Literature Search
We searched MEDLINE with OVID and PubMed (basic search, related articles,
and clinical queries search), EMBASE, and the Cochrane Database of Systematic
Reviews (CDSR) for relevant
articles13-15.
The search terms are available in the Appendix. We limited our search to the
English language. The last literature search was performed during the third
week of May 2006. Searches were performed by two of the authors (J.A.K.A. and
R.W.P.) independently.
Selection of Systematic Reviews
Inclusion Criteria
Studies were included if they used the term "systematic review"
or "meta-analysis" in the title or abstract. The included reviews
had to include a comparison between bone-patellar tendon-bone autografts and
hamstring autografts. There was no restriction in terms of the publication
date.
Exclusion Criteria
Narrative reviews, reviews that were part of a case report, editorials, and
studies that did not focus primarily on data synthesis were excluded. A
narrative review usually focuses on a broader topic without a
specific clinical question. Narrative reviews are typically book chapters that
cover a variety of broad topics, for example "anterior cruciate ligament
reconstruction," and not a specified question such as "what is the
relative effect of hamstring tendon autograft or bone-patellar tendon-bone
autograft on knee pain?" In addition, narrative reviews do not include
comprehensive and reproducible literature searches, assessments of study
quality, or statistical pooling of data.
Author Rationale for Repeating the Systematic Review
In the eligible systematic reviews, we searched for references to
previously conducted systematic reviews on the same topic, or so-called
"overlapping" systematic reviews. We investigated whether the
published review could have cited the previous report on the basis of its
acceptance for publication date or publication date and the reported date of
the last literature search or the date of the most recent citation in the
references. In cases of citation of previously reported systematic reviews or
meta-analyses, we abstracted the authors' rationale for repeating the
review.
Comprehensiveness of Reporting of Systematic Reviews
To assess the comprehensiveness, or quality, of reporting in the included
systematic reviews, we used the Quality of Reporting of Meta-analyses (QUOROM)
statement for the reporting of systematic
reviews8,10.
Detailed information on this statement and scoring scheme can be found at the
QUOROM website
().
Although designed to guide authors in preparing their manuscripts, this
checklist has been used in other published reports to describe the quality of
reporting8,9.
Internal Validity (Methodological Quality) of Systematic Reviews
The Oxman and Guyatt index, also known as the Overview Quality Assessment
Questionnaire (OQAQ), is a well validated tool designed to evaluate the
internal validity of a review (i.e., the extent to which bias is
limited in the review). Furthermore, the OQAQ has been used in other reports
describing the methodological quality of systematic
reviews1,6,8,9.
The scoring scheme is provided in the Appendix.
Direction of the Systematic Reviews' Conclusions
We abstracted the conclusions from the manuscript as quotations and
categorized them into three groups: (1) favors bone-patellar tendon-bone
graft, (2) favors hamstring tendon graft, and (3) inconclusive. Furthermore,
we compared the conclusions abstracted from the reviews with the OQAQ and
QUOROM results.
Heterogeneity Across Primary Studies
Heterogeneity is an apparent difference in the treatment effect across
studies. Heterogeneity may exist when the study populations, interventions,
outcomes, or study methodologies differ appreciably across the studies that
are being combined statistically. We analyzed methods that were used in the
reviews to examine differences in the results of the primary studies and
specifically looked at differences in primary study quality, primary study
participants (patients), and graft fixation techniques. So-called sensitivity
analysis is "any test of the stability of the conclusions of a
healthcare evaluation over a range of probability estimates, value judgments,
and assumptions about the structure of the decisions to be
made"3,16. We explored whether the reviews evaluated possible
sources of heterogeneity across studies and whether the investigators formally
performed a sensitivity analysis. These possible sources included the level of
surgical expertise and/or the presence of a learning curve, characteristics
related to the population of patients (i.e., level of activity [inclusion of
high-demand athletes or so-called weekend warriors], age, gender), additional
injuries (i.e., injuries involving the meniscus), the timing of the
reconstruction (acute or delayed), the use of different surgical techniques
other than the technique under investigation (i.e., different graft-fixation
techniques, pretensioning, or differences in one treatment group [two or
four-strand hamstrings], and rehabilitation protocols (i.e., full
weight-bearing or brace immobilization).
Reproducibility of Validity Assessment and Data Abstraction
All included manuscripts were independently assessed on methodological
reporting by three assessors (J.A.K.A., H.J.C., and R.W.P.). One assessor
(R.W.P.) was well trained in quality assessments, had completed a Cochrane
review course, and had co-authored a Cochrane systematic review of randomized
controlled trials and other systematic reviews. Regardless, if two assessors
disagreed, consensus was achieved after careful review and discussion. In
situations in which discrepancies persisted despite a consensus meeting, a
fourth assessor, an epidemiologist (M.B.), provided an additional opinion to
reach final consensus. This method of quality assessment, with a final
consensus meeting, has been commonly used in Cochrane
reviews17.
Analysis of the Included Reviews Using the Jadad Decision
Algorithm6
We used the guide to interpreting discordant systematic reviews as proposed
by Jadad and colleagues to evaluate our sample of eleven overlapping
systematic reviews6.
The decision algorithm is provided in
Figure 1. This algorithm helps
readers of systematic reviews to evaluate possible sources of discordance
among reviews. These sources include (1) the clinical question
(population of patients, interventions, outcome measures, and setting), (2)
primary study selection and inclusion (selection criteria,
application of selection criteria, strategies used to search the literature),
(3) data extraction from the primary studies (methods used to measure
outcomes, end points, human error [random or systematic]), (4) assessment
of primary study quality (methods used to assess quality, interpretations
of quality assessments, methods used to incorporate quality assessments in
review), (5) assessment of the ability to combine primary studies
(statistical methods, clinical criteria used to judge the ability to combine
studies), and (6) statistical methods for data synthesis.
Statistical Analysis
Our primary analyses were descriptive. We used proportions to describe
categorical variables and medians to describe continuous variables. We used
intraclass correlation coefficients to measure the level of agreement among
the
reviewers18,19.
The level of agreement was classified, according to the system described by
Landis and Koch, as slight (0 to 0.2), fair (0.21 to 0.40), moderate (0.41 to
0.60), substantial (0.61 to 0.80), or almost perfect
(>0.80)20. All
analyses were conducted with use of SPSS 14.0 (SPSS, Chicago, Illinois).
Eligible Systematic Reviews
We identified twenty-one systematic reviews in our initial search with use
of MEDLINE (OVID and PubMed), CDSR, and EMBASE. In addition, two more
systematic reviews were identified by means of a "related articles
search" in PubMed. Ten articles were eliminated because of duplication
or titles that were not relevant to our study. Furthermore, two more articles
were removed from our study (one because it was not published in the English
language and the other because it was not a formal systematic review), leaving
eleven systematic reviews that fulfilled our inclusion criteria
(Fig.
2)21-31.
Author Rationale for Repeating the Systematic Review
Of the nine systematic reviews with a sufficiently recent publication date
and literature search date, only six cited previously published systematic
reviews on the same topic. Of these six reviews, two cited all available
systematic reviews that were available at that time
(Table I).
The rationale for repeating the systematic review was provided by the
authors of six reviews, whereas the authors of one review did not provide a
rationale despite citing a previous overlapping review. The eleven eligible
reviews aimed to compare hamstring graft with bone-patellar tendon-bone graft
for reconstruction of the anterior cruciate ligament. However, reviews focused
on several outcomes (Table
II).
Literature Search and Databases Used
Table III gives details
about the included reviews and the respective databases that were searched.
One review31
involved a search of five databases, one
review29 involved a
search of four databases, two
reviews27,28
involved a search of three databases, and the remaining seven involved a
search of only one database. MEDLINE was searched in all reviews, CDSR was
searched in three reviews, and EMBASE and CINAHL were searched in one review
each. Other electronic databases that have not already been mentioned,
including WebSPIRS, Science Citation Index, Current Contents
database31; Pascal,
Herasmus28; and
InfoTrac29, were
searched in three reviews.
Inclusion of newer studies is dependent on the last date that the
literature search was performed and the type of eligible primary studies
(Table I). The total number of
primary studies used in the eleven eligible systematic reviews was
seventy-nine. The number of primary studies included in each review ranged
from two25 to
thirty-five26, with
a median of nine studies being cited. Seven reviews included only randomized
controlled trials as primary trials. Four reviews included both randomized
controlled trials and case series (Table
III)22,23,26,28.
The seventy-nine cited primary studies and the eleven reviews are presented in
a cross table in the Appendix to show the overlap in primary studies
included.
Systematic Review Results
Three reviews favored the patellar tendon graft for stability, and one
favored the hamstring graft. Six reviews favored the hamstring graft for
anterior knee pain, and none favored the patellar tendon graft. Three articles
favored the hamstring graft for range of motion in flexion and extension, and
two articles favored the hamstring graft for extension but the patellar tendon
graft for flexion. The Appendix provides details about the direction of
conclusions and commonly used outcome measures.
Compliance with the QUOROM Statement: Quality of Reporting (Assessor
Agreement and Results)
Agreement was almost perfect (intraclass correlation coefficient, 0.94; 95%
confidence interval, 0.88 to 0.98) among the three assessors for the QUOROM
checklist. QUOROM scores ranged from a high score of 18 to a low score of 5,
with a median score of 12 (Table
IV). Detailed information on each separate item is provided in the
Appendix.
Internal Validity: Oxman and Guyatt Index (Assessor Agreement and
Results)
Agreement was almost perfect among the three assessors for the Oxman and
Guyatt index (intraclass correlation coefficient, 0.83; 95% confidence
interval, 0.65 to 0.95). The Oxman and Guyatt score ranged from 1 to 7
(maximum possible score, 7), with a median score of 2
(Table IV). Seven reviews were
found to have "major flaws" (score, 1 or 2) as defined by the
index. Detailed results of the Oxman and Guyatt index are presented in the
Appendix.
Assessment of Sources of Differences Between Primary Studies
(Heterogeneity)
Sources for study heterogeneity were assessed in six systematic reviews
(Tables IV and
V). Subgroup, or sensitivity,
analysis was formally performed in only
two23,31
(Table V). However, most
reviews did discuss possible sources of differences between primary studies.
Differences in rehabilitation protocols were discussed in ten, the type of
hamstring graft (two or four-strand) in eight, and the type of graft fixation
in ten. The type of participant (athlete or nonathlete) was discussed but not
formally analyzed in five systematic reviews
(Table V).
Analysis of the Included Reviews with Use of the Jadad Decision
Algorithm6
Table VI provides details on
the results on each step of the Jadad algorithm. Step C and step D strongly
differentiated between the overlapping reviews of our interest. With use of
the Jadad algorithm, it was possible to select two high-quality
reviews28,31
in four steps.
Key Findings
In a review of eleven systematic reviews on the same topic of anterior
cruciate ligament repair with either hamstring or bone-patellar tendon-bone
tendon grafts, we found that (1) new "overlapping" systematic
reviews were conducted without citation of previously published reviews on the
same topic, (2) the quality of reporting varied among the eleven overlapping
systematic reviews as reflected in the QUOROM score, (3) we were able to
identify, with use of our proposed algorithm, two valuable high-quality
reviews that can help clinicians in decision-making, and (4) sensitivity
analysis to identify variables influencing the results was used
infrequently.
Previous Literature and Guide for the Evaluation of Overlapping
Systematic Reviews
Our findings suggested that the majority of systematic reviews had
methodological flaws. Utilizing the same validated Oxman and Guyatt index,
Bhandari and colleagues found similar limitations in the scientific quality of
orthopaedic
meta-analysis1. Our
study of eleven overlapping nonpharmaceutical reviews concurred with the
previous report1
that showed poor quality, mainly among nonpharmaceutical studies. However, the
present review did illustrate the possibility of selecting a high-quality
review from a set of reviews on the same topic.
Jadad et al. suggested a systematic approach when confronted with
discordant systematic
reviews6. More
recently, a "Users Guide" for the evaluation of systematic reviews
has been published, focusing on surgical
therapies32,33.
These guides are based on the validated Oxman and Guyatt
index12. Sources of
discordance among meta-analyses, as described by Jadad et
al.6, are (1) the
clinical question, (2) primary study selection and inclusion, (3) data
extraction from the primary studies, (4) assessment of primary study quality,
(5) assessment of the ability to combine primary studies, and (6) statistical
methods for data synthesis.
Our findings suggest that the included reviews differed mainly in terms of
their internal validity, with nine reviews being identified as unsuitable
tools to help in clinical decision-making. Jadad et al. proposed an algorithm
designed to guide readers through the evaluation of discordant systematic
reviews (Fig.
1)6. The
key issues are whether the reviews asked the same questions, whether the
reviews used the same primary studies, and, most importantly, whether the
reviews had the same quality. In our sample of overlapping reviews, the
quality was based on the quality of reporting (QUOROM) and the
internal validity (Oxman and Guyatt index) of the systematic reviews.
Only one review met all criteria of the Oxman and Guyatt
index31. The
results of our evaluation with use of the Jadad
algorithm6
illustrate how poor-quality reviews were eliminated. With use of the
algorithm, we noted the following findings.
Step A focuses on the study question of the review. This question comprises
three components: (1) the study population (adult patients with a torn
anterior cruciate ligament needing reconstruction), (2) the intervention under
examination (anterior cruciate ligament reconstruction with a bone-patellar
tendon-bone autograft or a hamstring tendon autograft), and (3) outcome
measures. The reviews in the present study differed with respect to the third
component. The primarily focus of one
review28 was
isokinetic muscle strength. A form of stability testing (the pivot shift test,
Lachman test, and/or instrumented laxity testing) was the outcome of interest
in ten reviews. The results of a validated outcome instrument were the focus
of six reviews. Adverse events were evaluated in nine reviews.
Step B selects the review with the most appropriate question to provide the
reader with information to help in his or her clinical decision-making.
Step C evaluates whether the reviews used the same primary studies in their
analyses. Although there was overlap in the included studies, the reviews that
we evaluated included different primary studies.
Step D evaluates the differences in review quality with the Oxman and
Guyatt index and the QUOROM statement. Readers interested in isokinetic muscle
strength can be guided by the review by Dauty et
al.28, which
demonstrates good internal validity (Oxman and Guyatt score, 6) but an average
quality of reporting (QUOROM score, 9). Readers interested in stability and
adverse events can be guided by the review by Biau et
al.31 (Oxman and
Guyatt score, 7; QUOROM score, 18). The remainder of the reviews have limited
value for clinical decision-making because of their poor internal
validity.
Only two reviews were methodologically
sound28,31.
Neither of these reviews could answer clinically relevant questions regarding
patient return to sports and patient satisfaction on the basis of a validated
patient-oriented outcome instrument, thus illustrating the limitations of the
current best evidence. On the other hand, these two high-quality reviews are
valuable in that they can help busy clinicians to provide the evidence-based
bottom line suggesting that hamstring tendon autografts are superior for
preventing anterior knee pain and limited evidence that bone-patellar
tendon-bone autografts provide better stability.
The lack of heterogeneity assessment and the rationale for the decision to
combine, or not to combine, eligible primary studies was another major
limitation. Despite the abundance of possible confounders such as primary
study quality, surgical expertise, patient activity level, and additional
injuries, formal sensitivity analysis was not performed in the majority of the
systematic reviews.
Strength and Limitations
Our review is strengthened by a literature search in duplicate and study
quality assessments in triplicate, with good interobserver agreement.
Furthermore, agreement was reached after consensus. In addition, a previously
validated and published quality-assessment tool was
utilized1,6,9,12.
Our study has a few important limitations. First, although validated, the
rating systems used in our review were originally utilized for the evaluation
of systematic reviews of nonsurgical interventions. The appropriateness of
pooling data from the primary studies must be considered with caution.
Clinical expertise and detailed knowledge about surgical technique will help
reviewers to decide whether or not to pool these results. Without interviewing
the authors of the systematic reviews, inferences on clinical expertise cannot
be made. Specifically focused questions on clinical expertise and surgical
skills in the systematic review-evaluating instruments are lacking to date. In
general, the decision to combine study results is based on a similarity of
populations, interventions, outcomes, and study methodologies across primary
studies. Statistically, widely overlapping confidence intervals and similar
point estimates of treatment effect are additional criteria for pooling. The
decision to pool primary studies in the meta-analysis of studies of surgical
interventions can be difficult, as surgical techniques and surgical expertise
might differ among studies. This stands in contrast to pharmaceutical studies,
in which drug administration may have less variability in treatment schemes
than surgical intervention does. For example, we are often uninformed about
the details of the surgical
technique34. In the
case of anterior cruciate ligament reconstruction, for example, such details
include the use (or nonuse) of pretensioning and the exact orientation of the
graft tunnels. These seemingly small differences may influence clinical
outcome and, therefore, our ability to pool the results in a meta-analysis of
surgical trials.
Second, although there was a large number of overlapping reviews on the
same topic, the number was still too sparse to allow statistical analysis.
Therefore, our analyses are primarily descriptive, and our findings may not be
generalizable to overlapping systematic reviews in other topic areas. However,
the principal issues raised by the present review likely reflect those in
other topic areas. Finally, our literature search could have missed relevant
publications.
Study Implications
Our study highlights the point that systematic reviews of randomized
controlled trials, although labeled as Level-I studies, can have limitations
that decrease their clinical
impact1,12,35.
The large variability in the methodological rigor of the eleven overlapping
reviews, the variable study designs of the included primary studies, and
differential inclusion of studies confuse the primary question: what is the
optimal approach to repairing the torn anterior cruciate ligament? A clear
approach to appraising the evidence is mandatory when multiple overlapping
reviews exist. According to the Jadad model, the review by Biau et
al.31 provides the
best estimate of the treatment effect given its methodological rigor, but
clinically important questions still remain unanswered. The Jadad
algorithm6 is a
useful tool for differentiating between overlapping reviews, as shown in the
present study. To evaluate surgical systematic reviews and meta-analyses,
emphasis should be placed on how differences in surgical techniques in the
primary studies were evaluated. Indeed, there is an abundance of randomized
controlled trials on anterior cruciate ligament reconstruction; thus, a high
level of evidence is available. To condense this wealth of information,
systematic reviews are very helpful in that they provide busy clinicians with
an evidence-based bottom line. The algorithm that we present may help readers
to select the best review available; in other words, the available evidence is
not bad at all compared with other controversies in orthopaedics.
The issue of graft choice will ultimately be resolved with a large,
appropriately powered, randomized trial comparing alternative approaches to
anterior cruciate ligament reconstruction utilizing a validated
patient-oriented outcome instrument to measure a clinically relevant treatment
effect. Ideally, this would be a multi-center, expertise-based trial to
facilitate the generalizability of its results.
Tables showing study details are available with the electronic versions of
this article, on our web site at
(go to
the article citation and click on "Supplementary Material") and on
our quarterly CD-ROM (call our subscription department, at 781-449-9780, to
order the CD-ROM). ?