The Oxford Levels of Evidence are now widely used to classify scientific
publications in orthopaedic surgery journals, such as this one, in order to
promote critical appraisal of the data according to the relative rigor of
various study
designs1-5.
This process has been criticized for undervaluing useful data from studies
with lower levels of
evidence6.
Furthermore, studies with the highest level of evidence (prospective
randomized trials) may be overvalued, and serious study flaws may be
overlooked because of a designation of Level I or II.
To test the hypothesis that a high level of evidence does not ensure a
high-quality scientific study, we used quantitative measures of study quality
to evaluate prospective randomized controlled therapeutic trials (Level-I and
Level-II studies) of treatments of lateral epicondylitis. This clinical issue
was selected because of the relatively large number of Level-I and II studies
available.
Asearch of
PubMed7, with use of
various combinations of the terms "lateral,"
"epicondylitis," "tennis," "elbow," and
"pain," identified 861 articles addressing lateral epicondylitis
spanning from September 1950 to December 2005. From this list, we identified
fifty-four therapeutic prospective randomized clinical trials. Two observers
(one with an MD degree and additional postgraduate training in epidemiology
and statistics [S.L.-C.] and the other a premedical student with no specific
epidemiological training [J.C.]) classified each trial according to three
systems: the Oxford Levels of
Evidence1, a
modified Coleman Methodology
Score8-10,
and the revised CONSORT (Consolidated Standards of Reporting Trials)
score11.
Each observer assigned an Oxford Level of Evidence. With this system,
Level-I therapeutic studies are defined as high-quality randomized controlled
trials demonstrating a significant difference or no significant difference but
with narrow confidence intervals. Level-II studies are lower-quality
randomized controlled trials (e.g., with <80% follow-up, no blinding,
improper randomization, and so on).
The Coleman Methodology Score was developed as a means of grading the
methodology of clinical studies on patellar and Achilles
tendinopathy8-10.
With this instrument, a maximum possible score of 100 points indicates that
chance, biases, and confounding factors had generally been avoided in the
study8. To
accommodate the topic and objectives of the present investigation, it was
necessary to modify both the categories and the point allocation of the
Coleman Methodology Score. These modifications consisted of (1) expanding the
inclusion criteria category to include a description of criteria and
enrollment rates; (2) adding a category to assess whether the statistical
power of the study and the methods used to calculate that power were reported;
(3) adding a category to assess the extent of randomization and whether it was
blinded; (4) expanding the patient follow-up category to include the
percentage of patients who were retained in the study; (5) adding a clinical
effect measurement category to account for whether the authors of the study
had reported the effect size, relative risk reduction, or absolute risk
reduction; and (6) adding categories to address each study's alpha error,
patient analysis, level of blinding, co-interventions, group comparability,
and number of patients to treat (Table
I). This modified score was scaled to result in values between 0
and 1001. The
categorical rating was considered to be excellent if the score was 85 to 100
points, good if it was 70 to 84 points, fair if it was 55 to 69 points, and
poor if it was =54 points.
The revised CONSORT statement consisted of a twenty-two-item checklist
pertaining to the content within various sections of a report on a prospective
randomized trial11.
The purpose of the CONSORT checklist is to provide a means with which to
compare the conduct of trials and the validity of their
results11. For each
of the twenty-two items on the checklist, a trial was given 1 point if it met
the criteria of the CONSORT statement and 0 points if it did not. Thus, the
maximum possible score was 22 points. The categorical rating was considered to
be excellent if the score was 18 to 22 points, good if it was 13 to 17 points,
fair if it was 8 to 12 points, and poor if it was =7 points
(Table II).
The Student t test was used to assess the interobserver reliability of the
scoring systems by testing the differences in the continuous scores. The kappa
statistic was used to evaluate the interobserver reliability of the
categorical ratings. The kappa statistic is a chance-corrected proportion of
agreement (i.e., reliability) calculated when independent observers make
categorical classifications. A categorical rating of the reliability was
assigned with use of the benchmarks for the kappa statistic described by
Landis and Koch12.
Descriptive frequency statistics were obtained for each nominal variable and
expressed as percentages for each observer and as a global average. Standard
descriptive statistics with measures of dispersion were also applied to
continuous variables. Statistical analysis was performed with use of SPSS
version-14.0 software (SPSS, Chicago, Illinois).
Oxford Levels of Evidence
There was substantial agreement in the classifications according to the
Oxford Levels of Evidence (? = 0.73, p < 0.01). Observer 1 believed
that five of the fifty-four publications fulfilled the criteria to be
considered Level I and forty-nine could be considered Level II, whereas
observer 2 considered three of the publications to be Level I and fifty-one to
be Level II.
Modified Coleman Methodology Score
There was no significant difference between the mean modified Coleman
Methodology Scores of the two observers (59.8 compared with 59.0 points; p =
0.61). The two observers had substantial agreement, according to the standards
of Landis and
Koch12, with regard
to their assignment of categorical ratings based on the modified Coleman
Methodology Score (? = 0.73; p < 0.01).
Combining the assessments of both observers resulted in a mean total score
of 59 points (range, 43 to 80 points) on the scale of 100 points. Observer 1
and observer 2 assigned a good categorical rating of quality to 13% and 11% of
the studies, respectively; a fair rating to 59% and 56%; and a poor rating to
28% and 33%. None of the studies were rated excellent.
Both observer 1 and observer 2 found that many articles (61% and 63%,
respectively) failed to describe how patients had been enrolled or the
percentage of enrollment. Both found an enrollment rate of <80% in many of
the trials (15% according to both observers). Description of the enrollment
process was often absent (in 15% of the trials according to both observers).
Another shortcoming of the reports on the trials is that most (52% and 54%)
failed to state how the power and the sample size were calculated. A
substantial percentage of the articles (35% and 33%) stated that the power
level was >80%, but did not explain the methodology used to obtain this
value, and a power level of >90% was reported in very few trials.
Similarly, 91% of the reports did not provide the number of patients needed
for successful treatment (also known as the "number needed to
treat" to see one treatment success). Effect sizes were not reported in
69% of the trials and were reported to be <50% in 26% of them. There was no
description of a rehabilitation protocol in the majority of the reports (96%
and 94%), and only a few reports evaluating surgical techniques (2% and 3%)
described postoperative management adequately.
The reporting of patient follow-up, blinding, and sample size may not be
considered areas of failure in these trials, but these categories were
certainly not strengths of the trials in general. The most common score in the
follow-up category was 4 points (assigned to 48% and 51% of the trials), which
corresponded to a follow-up period of more than twenty-four months and a
<80% retention rate, a six to twenty-four-month follow-up and an 80% to 90%
retention rate, or a less than six-month follow-up period and a >90%
retention rate. The next most common score was 6 points (assigned to 26% and
32% of the trials), whereas scores of 0 or 2 points were less frequent. Both
observers found the majority of the trials (59% and 52%) to be double-blinded.
While both observers also found many trials to be single-blinded (22%), almost
as many trials did not have any type of blinding whatsoever (19%). Regarding
the sample-size criteria, most trials (44% and 46%) had more than sixty
patients, many had forty-one to sixty patients (28% and 26%) or twenty to
forty patients (22%), and a smaller percentage (6%) had fewer than twenty
patients.
There were other categories, however, in which the majority of the trials
scored well according to the modified Coleman Methodology Score. Most trials
(78% and 80%) were randomized and blinded, some (17% and 19%) were randomized
and not blinded, and only a few (6% and 2%) were not randomized. In the
majority of the trials (100% and 94%), the analysis was performed on an
intention-to-treat basis. Also, treatment was usually adequately described (in
96% and 94% of the trials). Treatment groups were comparable in most trials
(89% and 85%) and were partially comparable in the remainder. In the majority
of the studies (98% and 93%), outcome was reported by the recruited patients,
and an independent investigator's assessment was the sole measure of outcome
in only approximately 3% of the trials. The majority of the articles (74% and
76%) stated that no co-interventions were permitted during the trial period.
However, co-interventions were observed in some trials (13% and 19%). A
smaller percentage of the studies (13% and 6%) did not satisfy the criteria
for similarity of treatment. Most reports (93% and 85%) stated that the level
of significance had been set at 0.05, whereas some (6% and 11%) did not
provide any alpha error and even fewer (2% and 4%) stated that the level of
significance had been set at 0.01.
Revised CONSORT Statement
The difference between the average CONSORT scores assigned by the two
observers was not significant (12.0 compared with 11.6 points; p = 0.55).
Interobserver agreement for the revised CONSORT statement was moderate
according to the Landis and Koch benchmarks (? = 0.53; p < 0.01).
The two observers' combined assessments yielded a mean total score of 12
points (12 and 12 points). The maximum scores assigned by observer 1 and
observer 2 were 19 and 20 points, respectively, and the minimum scores were 6
and 5 points. Most trials were given either a fair (56% and 50%) or a good
(32% and 32%) categorical rating. A smaller percentage of trials were
considered to be excellent (7% and 6%) or poor (6% and 13%).
In the majority of the reports reviewed in this study, the Introductory
section met the guidelines of the CONSORT checklist: most included an adequate
title and abstract (83% and 87%) as well as complete background information
(87% and 85%).
The reporting in the Methods sections was not as consistent. Outcome
variables and objectives were clearly defined and reported in almost all of
the articles (98% and 100%), as were the details of the interventions in the
treatment groups, including how and where the interventions took place (91%
and 94%). However, the majority of the articles (72% and 74%) did not report
sample size sufficiently because most did not provide the methodology or
statistics behind the determination of this calculation. The eligibility
criteria for participants was not adequately reported in >60% of the
articles (67% and 65%), as these articles did not include information
regarding the specific settings and locations where data were collected.
Approximately one-third of the reports fulfilled the revised CONSORT criteria
in this category. Most (91% and 89%) did not adhere to the CONSORT guidelines
for describing their specific objectives and hypotheses.
More than half of the articles (57% and 59%) did not adequately describe
the methods used to generate the random allocation sequence for participants.
Similarly, on the average, almost 85% of the articles (82% and 85%) did not
provide sufficient information with regard to implementation of randomization
and almost 90% (89% and 91%) did not fully describe blinding. While the
authors of most articles reported on blinding generally, details on measuring
the success of blinding, which is required by the revised CONSORT statement,
were rarely included. Approximately 65% of the articles (67% and 63%) did not
sufficiently describe concealment of allocation from participants. Regarding
methods and randomization, >80% of the articles (87% and 89%) did not
include adequate information about the statistical methods used for comparing
treatment groups.
As in the Methods sections, the Results sections of the reports rarely
included all of the details required by the revised CONSORT statement. Most
were adequate in terms of reporting the number of participants included in the
analysis of each treatment group (85% and 83%), baseline data (68% and 68%),
and outcomes and estimations (80% and 69%). However, only about half of the
articles included a description of the flow of participants through each stage
of the study (54% and 46%) and the adverse effects of therapy (56% and 50%).
Both observers found that the reporting of the results of the trials was quite
poor in two important areas. On the average, the location and dates of patient
recruitment were not reported in >90% of the articles (91% and 94%) and
ancillary analyses were not provided in >80% (80% and 85%).
The interpretation of results was adequately reported in about
three-fourths (70% and 79%) of the Discussion sections of the articles.
Similarly, both observers agreed that 78% of the Discussion sections presented
overall evidence or the general interpretation of results in the context of
available literature according to CONSORT guidelines. However, on the average,
less than half of the Discussion sections (46% and 52%) adequately presented
any external validation of results.
To determine the most effective treatments for our patients, we follow the
principles of evidence-based medicine and turn to prospective randomized
trials as our gold standard. This study was performed to measure the relative
quality of prospective randomized clinical trials. The modified Coleman
Methodology Score and the revised CONSORT statement proved to be reliable for
the evaluation of trial quality.
The vast majority (>90%) of the published therapeutic clinical trials on
lateral epicondylitis were considered to be of low quality, or Level II,
according to our application of the Oxford Levels of Evidence. The modified
Coleman Methodology Score was used to further quantify the inadequacies of
published clinical trials for the treatment of lateral epicondylitis, with
>87% of the studies being rated fair or poor. The strength of these trials
lies in the reporting of the alpha error, randomization, intention-to-treat
analysis, treatment description, group comparability, and outcome measures,
rather than in areas that indicate a higher level of reporting quality. The
authors of these therapeutic trials frequently failed to describe how patients
were enrolled and the percentages of enrollment as well as the methodology of
determining a power level and sample size. Other areas of deficiency included
follow-up, sample size, and blinding.
According to the CONSORT statement, >62% of the clinical trials were
rated fair or poor. Again, areas of strength included the most straightforward
aspects of the paper, including the introductory details, the descriptions of
the interventions under evaluation and the statistical methods, analysis and
reporting of the results, interpretation of findings, and analysis of overall
evidence. Areas of weakness included inadequate reporting of eligibility
criteria, allocation concealment, patient recruitment, ancillary analysis, and
sample size determination. The phenomenon of providing only basic information
without including the details of methodology was also prevalent in areas such
as sequence generation, implementation of the randomization, blinding, and
follow-up protocol. We also found room for improvement with regard to
reporting of the flow of participants through the stages of the trial, the
inclusion of baseline data in the Results section, outcomes and estimations,
adverse effects, and external validation of the results.
Stratifying orthopaedic scientific reports according to levels of evidence
implies a degree of respect for Level-I and II trials that may not always be
merited. Our data suggest that the quality of these higher-level
trials—at least those involving lateral epicondylitis—varies
substantially and is often unsatisfactory. There may be other criteria that
correlate with trial quality. Many trials were reported in relatively obscure
journals such as the Journal of Manipulative and Physiological
Therapeutics, Photomedicine and Laser Surgery, Clinical Rehabilitation,
Prosthetics and Orthotics International, The Surgeon: Journal of the Royal
Colleges of Surgeons of Edinburgh and Ireland, Journal of Traditional Chinese
Medicine, and Australian Family Physician, among others. Others,
such as studies evaluating extracorporeal shock wave
therapy13-17,
laser
treatment18,19,
and anti-inflammatory
medications20-22
or other
medications23,24,
were performed under the influence of a manufacturer or a strong advocate of
the technique. Positive findings in therapeutic clinical trials have been
associated with the presence of commercial
funding25,26.
There is no doubt that performance of a trial with a well-defined study
question and a prospectively applied protocol, with an appropriate control
group and blinded treatment allocation and assessment, is the best way to
overcome many of the shortcomings of the current literature. On the other
hand, the use of this study design ensures neither quality science nor
internally and externally valid data. This report serves to emphasize the need
to critically evaluate all scientific reports regardless of study design or
level of evidence. ?