Extract
You are an orthopaedic oncologist who has recently excised a high-grade soft-tissue sarcoma from the thigh of a fifty-year-old woman. This patient does not have metastatic disease, has had preoperative radiation, and is now in your clinic for her second postoperative visit. After some wound-healing issues, her wound is now healed and she would like to know if her chance of long-term survival would increase if she underwent adjuvant chemotherapy. Your sarcoma multidisciplinary team generally does not offer chemotherapy to patients with localized disease. However, you decide to consult the literature.
You are an orthopaedic oncologist who has recently excised a high-grade soft-tissue sarcoma from the thigh of a fifty-year-old woman. This patient does not have metastatic disease, has had preoperative radiation, and is now in your clinic for her second postoperative visit. After some wound-healing issues, her wound is now healed and she would like to know if her chance of long-term survival would increase if she underwent adjuvant chemotherapy. Your sarcoma multidisciplinary team generally does not offer chemotherapy to patients with localized disease. However, you decide to consult the literature.
You find a meta-analysis by the Sarcoma Meta-analysis Collaboration (SMAC) that was published in The Lancet in 1997 and by the Cochrane Collaboration in 20001. The meta-analysis includes data from fourteen trials involving 1568 patients with localized soft-tissue sarcoma who were randomized to chemotherapy or to no chemotherapy. The article concludes that “there is evidence that doxorubicin-based chemotherapy after initial treatment reduces recurrence, either at the original site or elsewhere in the body…[and] seems to increase the length of time patients live, but this is less certain.” Based on nine subgroup analyses of treatment effect, “greater benefit was seen in men and those whose tumor originated in a limb.”
Given the conclusion of the meta-analysis, would you choose not to offer chemotherapy because of the fact that your patient is female?
Information obtained from randomized controlled trials is considered the best available evidence for determining a therapeutic treatment effect. However, randomized trials assess a representative sample of patient populations. In order to provide data on individual subsets of patients, subgroup analyses are often done2-4. Information from subgroups is potentially helpful for making treatment decisions for individual patients. In fact, a large percentage of clinical trials include subgroup analyses with the goal of meeting clinicians’ needs for information specific to individual patients5,6. The rate of subgroup analyses has been as high as 70% in the medical literature and 38% in the surgical literature7,8. However, if subgroup analyses are incorrectly done, the data may actually be misleading as observed differences in treatment effect may arise spuriously from chance alone. In the current report, we will outline the correct strategies for implementing subgroup analyses for randomized controlled trials and will use these criteria for the presented meta-analysis. We will also present guidelines for interpreting the results of subgroup analyses (Fig. 1). Each section will outline specific criteria that should be examined before making conclusions from subgroup analyses.
It is important that subgroups be identified a priori, at the beginning of the trial. These subgroups can be determined from previous research reports and should have biological, pathophysiologic, or genetic plausibility3,7. As is the case with meta-analyses, the subgroup hypotheses should be generated at the inception of the study. Although there are exceptions, with unexpected results being identified within subgroups that are not identified at the start of a trial or study, a subgroup hypothesis that is identified a priori is more likely to generate results based on a true treatment effect as opposed to chance alone. In contrast, a post hoc hypothesis that is generated after the study has been completed is the result of findings that were not necessarily expected from previous research and therefore have a higher chance of representing spurious results9,10.
Indeed, post-randomization subgroup analyses can also potentially be biased by the fact that the subgroups may have different prognostic characteristics, which could lead to a perceived subgroup difference7. In addition, subgroup effects may be affected by the intervention itself7. Therefore, not only do the potential subgroups need to be established prior to the initiation of a trial or study, the potential direction of the treatment effect should be hypothesized and the statistical methods that will be used to analyze the subgroups in the trial should be presented.
The performance of a post hoc analysis is common in the surgical literature; in one report, only 9% of fifty-four subgroup analyses within twenty-seven surgical trials were planned at the start of the trial8.
Current Scenario
The SMAC investigators did not indicate that the subgroups were identified a priori; therefore, it is inferred that the subgroups were analyzed in a post hoc hypothesis-generating manner. The investigators also indicated that data were not available for all patients for all subgroup analyses, supporting the conclusion that the subgroup hypotheses were generated post hoc.
The next step is to determine how many other subgroup hypotheses were tested. It is common to explore a large data set with multiple subgroup hypotheses to identify a significant predictor of treatment effect. Some surgical trials have examined as many as twenty-three subgroups8. Understandably, the risk of identifying a false-positive “significant” subgroup effect increases as the number of subgroups increases. In other words, the greater the number of subgroup hypotheses tested, the greater the likelihood of identifying a subgroup effect by chance alone11. Not only is the risk of false-positive results high, so is the risk of false-negative results as most studies are underpowered to determine a significant treatment effect within subgroups5.
Another challenge in this step is related to the fact that investigators do not always disclose the number of subgroup hypotheses tested and choose to report only those that were found to be significant. Investigators should adjust the level of significance for multiple subgroup analyses. There are a number of different ways to do this statistically. One method is to divide the significance level by the number of subgroup analyses, the so-called Bonferroni method12. This method has its drawbacks, and other methods are available; however, a discussion of these methods is beyond the scope of the current report. Suffice it to say, however, that clinicians can be misled if only the significant results are presented and the number of hypotheses tested is not provided.
Current Scenario
A total of nine subgroups were analyzed in the SMAC study, including patient age, patient sex, disease status at the time of presentation, disease site (extremity, trunk, or uterus), tumor type, tumor grade and size, type of surgical excision, and the use of radiation therapy. A significant benefit in terms of overall survival was identified in the subgroup analyses for age (thirty-one to sixty years), sex (male), disease site (extremity), and tumor size (5 to 10 cm). The large number of subgroup analyses and the lack of statistical adjustment in the study indicate that the subgroup treatment effects were unlikely to represent real treatment effects and potentially were due to chance alone.
Effect sizes in different groups inferred from between-study differences are less likely to be true effect sizes than those inferred from within-study differences. This is because between-study differences are indirect comparisons9,13. The patient populations, types of drugs used, co-interventions, and outcomes may differ from study to study. Therefore the “control group” may be compared with a certain “treatment group” in one study but with a different “treatment group” in another study. More direct comparisons are provided by within-study comparisons in which co-interventions are controlled and outcome criteria are identical. However, Song et al. suggested that, when direct comparisons are not available, adjusted indirect comparisons may approximate direct comparisons14.
Current Scenario
The SMAC investigators pooled all individual data points from the studies included. Therefore the “treatment” arm consisted of patients managed with various chemotherapeutic regimens, albeit all doxorubicin-based. Therefore, the comparisons were not consistent across all studies and the subgroup differences were inferred from between-study and not within-study differences.
A small subgroup difference is less likely to be plausible than larger observed effects. As the effect size of the subgroup analysis increases compared with that of the overall study population, the more plausible the subgroup analysis becomes. However, when sample sizes are small, it is possible to identify a large apparent difference by chance11. In fact, if the number of patients in a subgroup analysis is small, more often than not a large treatment effect would be found in error.
Current Scenario
The subgroup analysis for sex resulted in an odds ratio (OR) of 0.75 (95% confidence interval [CI], 0.60 to 0.93) for improved survival in men who received chemotherapy. In women, the OR was 1.01 (95% CI, 0.82 to 1.25). For extremity sarcomas, the OR was 0.80 (95% CI, 0.65 to 0.98) in favor of chemotherapy, whereas that for trunk sarcomas was 1.00 (95% CI, 0.66 to 1.52). These subgroup differences were similar to those of the overall study population, with an OR of 0.89 (95% CI, 0.76 to 1.03) for improved survival following chemotherapy, making the treatment effect conclusions of the subgroup analyses less plausible.
A treatment effect identified in a subgroup analysis in one individual study is important for generating a hypothesis. However, if the subgroup effect is consistent across many studies, the credibility of the conclusion increases substantially. In fact, consistent subgroup differences identified in a rigorous systematic review are considered to be highly credible9,11. However, it is important that the subgroups and identified outcomes are identical across studies.
Current Scenario
The individual patient data points from all fourteen studies were pooled in the SMAC study; therefore, it is not possible to determine if the subgroup differences were consistent across studies.
Although researchers often report significance within subgroup analyses, it is more important to determine if the differences of treatment effect between subgroups can be explained by chance alone. Investigators use a variety of statistical techniques to explore whether chance can explain apparent subgroup differences5,15. This is known as a test for interaction. For this test, the null hypothesis is that there is no difference between subgroups. The p value is used to identify the risk of finding different treatment effects between subgroups so that a p value of, for example, 0.04 indicates an acceptable 4% risk of findings due to chance alone and a p value of 0.18 indicates an unacceptable 18% risk. Previous reports have suggested that these p values are not always reported, and many studies do not include tests of interaction5,6. In addition, values of significance for tests of interaction should be adjusted for multiple-hypothesis testing. Sun et al. suggested that as the p value gets smaller the subgroup hypothesis becomes more credible and that we should not use specific levels of significance7. The authors suggested that with p values of >0.1, we should be skeptical of any true subgroup effect and that as the p values become 0.001 or less, we can be more confident in the subgroup hypothesis5,7.
Current Scenario
In the SMAC study, the p value for the statistical test of interaction was considered significant for men benefiting from chemotherapy (p = 0.049) and for extremity sarcomas (p = 0.029). Conversely, the criteria set out by Sun et al. would suggest that, based on these values, we should consider these findings hypothesis-generating and worthy of further investigation. However, the SMAC investigators did state that “there was no clear evidence to suggest that any subgroup benefited more or less from adjuvant chemotherapy”.
The plausibility of a subgroup difference is based at least in part on external evidence9. Three types of external evidence can be used to apply to subgroup differences: observations in different populations, such as animal studies; observations of subgroup differences for similar interventions; and the results of studies of other related outcomes (intermediary or surrogate outcomes). The strongest evidence lies in observations for related or intermediary outcomes, such as overall survival and disease-free survival. Observations from similar interventions (different chemotherapeutic drugs) do not provide as much support to draw conclusions.
Current Scenario
There is ample evidence in the oncology literature to suggest that patients with extremity sarcomas have improved survival compared with those with retroperitoneal sarcomas16,17. It is believed that the late presentation and difficulty of complete resection associated with retroperitoneal sarcoma result in lower survival rates. It is therefore reasonable to conclude that patients with extremity sarcomas may benefit from adjuvant treatment because, at baseline, they have an improved chance of survival. However, there are no preclinical or clinical data to support the possibility that sarcomas that arise in men are more likely to respond to chemotherapy than those that arise in women or that survival rates are equal among men and women16,17.
There is a tendency for investigators to report significant findings. Post hoc multiple subgroup analyses often provide at least one significant subgroup treatment effect, as would be found by chance alone. Therefore, erroneous conclusions based on these findings that were identified by chance alone can have devastating effects on patient care. The decision to give credibility to a treatment effect identified on the basis of a subgroup analysis can be aided by examining several criteria as outlined in this review. Relatively small, marginally significant results based on between-study differences or identified by means of a post hoc exploration of a single data set must be viewed with suspicion. However, large, clinically important interactions supported by both indirect and direct evidence and independently tested either in a new trial or in a meta-analysis in which the possibility of the interactions resulting from chance is low have strong potential for clinical decision-making. In the middle of the spectrum are inferences for which the criteria are only partially satisfied. New primary studies or meta-analyses would be needed to determine if these subgroup differences could confidently be used to guide practice.
You decide that the conclusion that “men benefit greater from chemotherapy” is erroneous and inform your patient that chemotherapy may have a marginal benefit on survival, which must be weighed against the associated toxicities. As such, you offer her a referral to your medical oncology colleague.
Sarcoma Meta-analysis Collaboration (SMAC). Adjuvant chemotherapy for localised resectable soft tissue sarcoma in adults. Cochrane Database Syst Rev.
2000;.
Dijkman
B;
Kooistra
B;
Bhandari
M; Evidence-Based Surgery Working Group. How to work with a subgroup analysis. Can J Surg.
2009;52:515-22.[PubMed]
Rothwell
PM. Treating individuals 2. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. Lancet.
2005;365:176-86.[CrossRef][PubMed]
Hirji
KF;
Fagerland
MW. Outcome based subgroup analysis: a neglected concern. Trials.
2009;10:33.[CrossRef][PubMed]
Pocock
SJ;
Assmann
SE;
Enos
LE;
Kasten
LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med.
2002;21:2917-30.[CrossRef][PubMed]
Assmann
SF;
Pocock
SJ;
Enos
LE;
Kasten
LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet.
2000;355:1064-9.[CrossRef][PubMed]
Sun
X;
Briel
M;
Walter
SD;
Guyatt
GH. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses. BMJ.
2010;340:.
Bhandari
M;
Devereaux
PJ;
Li
P;
Mah
D;
Lim
K;
Schünemann
HJ;
Tornetta
P
3rd. Misuse of baseline comparison tests and subgroup analyses in surgical trials. Clin Orthop Relat Res.
2006;447:247-51.[CrossRef][PubMed]
Oxman
AD;
Guyatt
GH. A consumer’s guide to subgroup analyses. Ann Intern Med.
1992;116:78-84.[PubMed]
Sleight
P. Debate: Subgroup analyses in clinical trials: fun to look at - but don’t believe them! Curr Control Trials Cardiovasc Med. 2000;1:25-27.
Wikstrand
J;
Wedel
H;
Ghali
J;
Deedwania
P;
Fagerberg
B;
Goldstein
S;
Gottlieb
S;
Hjalmarson
A;
Kjekshus
J;
Waagstein
F. How should subgroup analyses affect clinical practice? Insights from the Metoprolol Succinate Controlled-Release/Extended-Release Randomized Intervention Trial in Heart Failure (MERIT-HF). Card Electrophysiol Rev.
2003;7:264-75.[CrossRef][PubMed]
Bristol
DR. p-value adjustments for subgroup analyses. J Biopharm Stat.
1997;7:313-21; .[CrossRef][PubMed]
Cleophas
TJ;
Zwinderman
AH. Random effects models in clinical research. Int J Clin Pharmacol Ther.
2008;46:421-7.[PubMed]
Song
F;
Altman
DG;
Glenny
AM;
Deeks
JJ. Validity of indirect comparison for estimating efficacy of competing interventions: empirical evidence from published meta-analyses. BMJ.
2003;326:472.[CrossRef][PubMed]
Brookes
ST;
Whitely
E;
Egger
M;
Smith
GD;
Mulheran
PA;
Peters
TJ. Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. J Clin Epidemiol.
2004;57:229-36.[CrossRef][PubMed]
Grobmyer
SR;
Brennan
MF. Predictive variables detailing the recurrence rate of soft tissue sarcomas. Curr Opin Oncol.
2003;15:319-26.[CrossRef][PubMed]
Kattan
MW;
Leung
DH;
Brennan
MF. Postoperative nomogram for 12-year sarcoma-specific death. J Clin Oncol.
2002;20:791-6.[CrossRef][PubMed]
Guyatt
GH;
Rennie
D;
Meade
MO;
Cook
DJ. Users’ guides to the medical literature: a manual for evidence-based clinical practice. 2nd ed.New York: McGraw-Hill; 2008.