Is the Difference Suggested by Comparisons within Rather Than Between Studies?
Inferences regarding differential effects on the basis of between-study comparisons are much weaker than those based on within-study comparisons. By within-study comparisons, we mean a situation in which patients in the subgroups under consideration were all enrolled in the same trial or trials. By between-study comparisons, we mean a situation in which patients in the subgroups of interest were enrolled in different trials, each of which addressed only one of the subgroups of interest.
Consider a situation in which Trial A enrolls exclusively patients with closed fractures and finds that reamed intramedullary nailing is superior to unreamed intramedullary nailing. Trial B enrolls only those with open fractures and reports results favoring unreamed intramedullary nailing. What are the possible explanations of the finding?
Although one might be tempted to attribute the difference to the different study samples—those with closed and open fractures—there are many other possibilities. Trial A may have minimized the risk of bias by concealed allocation, blinding outcome assessors, and achieving nearly complete follow-up. Trial B may have failed in all of these regards, and the different risk of bias may be responsible for the different results. There may be other differences in the patients’ characteristics (age or degree of osteoporosis) that could explain the differences. Surgeons in Trial A may have been more experienced in the reamed procedure and surgeons in Trial B, in the unreamed procedure. Trial A may have measured only short-term outcomes, while Trial B followed patients for a longer period of time. Finally, chance may explain the apparent difference of treatment effects between subgroups. We define chance as an apparent difference in which the underlying truth is that there is no difference of effect.
What if, however, patients with open and closed fractures are participating in a single trial in which the reamed procedure appeared superior in those with closed fractures and the unreamed procedure appeared superior in those with open fractures? Trial methods, eligibility criteria, and surgeon expertise are likely to be identical for all patients. When, as is likely to be the case within a single trial, methods are identical for the patients in the subgroups of interest, we are left with only two compelling explanations: the subgroup effect is real, or chance is responsible for the apparent difference.
In our example, comparisons of treatment effects between closed and open fractures, and between current smokers and nonsmokers or former smokers, are made within a single study. This increases the credibility of the presence of the differential effects between current smokers and nonsmokers or former smokers and between open and closed fractures.
Does the Interaction Test Suggest a Low Probability That Chance Explains the Apparent Subgroup Effect?
When examining subgroup hypotheses, one must address the probability that the observed differences in effects can be explained by chance. The statistical approach that addresses this fundamental issue is called a test for interaction (the interaction meaning that the effect differs across subgroup categories such as patients with open fractures compared with patients with closed fractures)20. Typically, a test of interaction compares the estimated treatment effects—measured as relative risk, odds ratio, or difference in mean change—in subgroups of the study population, and addresses and presents the possibility of differences as great as or greater than those observed if there is truly no difference in effect between subgroups. An inappropriate, although frequently used, approach is to separately test the significance of the effect in each subgroup category. Such analyses fail to address the fundamental issue: can the difference in effect between subgroup categories be explained by chance?
The null hypothesis of the test for interaction is that there is no difference in the underlying true effect between subgroup categories. The lower the p value, the less likely it is that chance explains the apparent subgroup effect. Typically, investigators use the usual threshold p value of 0.05. Inevitably, the choice of threshold involves subjective judgment. An approach that avoids the arbitrariness of a single threshold is to consider that the larger the p value (e.g., >0.1), the more likely that chance explains the apparent difference in subgroup effects and the smaller the p value, the less likely that chance explains the apparent difference. The p value for the test of interaction is associated with sample size; the larger the sample size, the more likely the null hypothesis will be rejected if a subgroup effect exists. Sometimes investigators may examine post hoc the power of testing the subgroup hypotheses. This endeavor does not help in deciding on the credibility of the subgroup analysis: the issue of relevance is the possibility that chance could explain the observed findings.
Our analysis showed a small interaction p value for the test of the subgroup hypothesis by fracture type (p = 0.01). The p value is small enough to ensure that the subgroup effect is unlikely to be explained by chance; the p value for the test of the smoking hypothesis was even smaller (p = 0.0013). These results strengthen the inferences that the two subgroup hypotheses represent real effects.
Are the Subgroup Hypothesis and Its Direction Specified a Priori?
One may specify the subgroup hypothesis before or after the data are disclosed. Typically, a priori specification, which is driven by previous research evidence and/or biological rationale, represents careful consideration by the investigators regarding the possibilities of a significant interaction. At the other extreme, conducting subgroup analyses post hoc is likely to be data-driven: investigators highlight an apparent subgroup effect only after discovering it in the data, the so-called "data-fishing" approach. Accurate specification of the direction of the subgroup hypothesis a priori (for instance, specifying that reamed nailing will be superior in closed fractures and unreamed nailing in open fractures, rather than suggesting only that effects may differ in open and closed fractures) further strengthens the credibility of the subgroup inference (and lack of specification—or getting the direction wrong—undermines it). A desirable approach is for researchers to state explicitly in study protocols their subgroup hypotheses and the direction of the hypothesized subgroup effect.
Closed Compared with Open Fractures
Our subgroup hypothesis by fracture type (i.e., open and closed) was specified at the stage of trial design and was the subgroup hypothesis of primary interest. Not only was the hypothesis a priori, but the direction of the effect based on a compelling biological rationale was correctly specified. This enhances its credibility.
Smoking
The smoking subgroup hypothesis was specified only after the initial analysis was complete and would never have been explored had it not been part of an exercise for an advanced statistics course. At the time the smoking analysis was initially conducted, we had no hypothesis about its direction. In our blinded surveys, orthopaedic surgeons chose smoking as the most probable of a number of additional hypotheses conducted as part of the exercise for the statistics course. However, surgeons were split in choosing the direction that the effect should go (i.e., whether smokers would do better with reamed or unreamed nailing). The uncertainty about the direction of subgroup effect among expert orthopaedic surgeons suggests the absence of a compelling biological rationale. To the extent that one finds the presence of a compelling biological rationale important (some may not), uncertainty about the direction in effect would reduce the strength of inference regarding the presence of an underlying subgroup effect.
Is This One of a Small Number of Subgroup Hypotheses Tested?
Typically, one test of interaction carries a small risk of a false-positive finding. Multiple tests of interactions increase the possibility of a false-positive conclusion, and the more tests conducted, the greater the problem. Thus, a large number of tests of subgroup hypotheses may compromise the strength of a priori specification, and the credibility of significant subgroup effects decreases.
Closed Compared with Open Fractures
Our subgroup hypothesis by fracture type was one of the seven a priori hypotheses tested. It is also the subgroup hypothesis of primary interest as reflected in our stratification of randomization by open and closed fracture. This strengthens its credibility.
Smoking
After the data were disclosed and treatment allocation unblinded, we tested five post hoc subgroup hypotheses (beyond the seven we had generated a priori). These hypotheses were specified independently of the previous seven; the hypothesis that smoking might influence the magnitude of effect was one of the five post hoc subgroup hypotheses tested. One could view this hypothesis as one of twelve (i.e., seven a priori and five post hoc, and the relatively large number would weaken the subgroup inference) or as one of five post hoc hypotheses (in which case the relatively small number would not weaken the inference as much).
Is the Magnitude of Subgroup Effect Large?
The apparent treatment effect will inevitably differ among subgroup categories (e.g., open compared with closed and current smokers compared with nonsmokers or former smokers). Small differences in effects across subgroup categories are likely explained by chance; the larger the difference in effects between subgroup categories, the more likely the difference represents a true interaction. Large differences in the presence of small sample sizes may, however, occur by chance.
To determine the possibility that chance explains the apparent difference, an alternative to the statistical test of heterogeneity is to consider the confidence interval around the magnitude of subgroup effect21. For presenting the magnitude of the difference for a continuous variable, authors can use differences. If the outcome is a binary variable, they may present the ratio of relative risks (or ratio of hazard ratios if the outcome is time-to-event data). In the presence of a qualitative interaction (i.e., treatment is beneficial in one subgroup, whereas it is harmful in another), however, interpretation of a confidence interval around the magnitude becomes problematic. In this situation, we recommend considering a point estimate only.
Consider that a subgroup analysis shows that a treatment reduces the risk of pulmonary embolism by 58% in patients over sixty years old (RR, 0.42; 95% CI, 0.23 to 0.75) and by 20% in patients sixty years old or less (RR, 0.80; 95% CI, 0.68 to 0.95; test for interaction, p = 0.034). This indicates that the treatment effect (i.e., RR) on the reduction of pulmonary embolism is nearly twice as great in patients over sixty years old than in others (i.e., 0.80/0.42 ˜ 2); the 95% confidence interval around this ratio is 1.05 to 3.58.
Both the smoking and fracture type subgroup analyses yielded large and qualitative subgroup effects. Reamed intramedullary nailing reduced the relative risk of reoperation by 33% in closed fractures, but increased the risk by 27% in open fractures. Reamed nailing reduced the relative risk of reoperation by 32% in nonsmokers or former smokers, whereas it increased the risk by 56% in current smokers. These large differences of treatment effects across subgroups increase the credibility of the subgroup hypotheses.
Is the Observed Differential Effect Consistent Across Studies?
Even small p values do not exclude the possibility that chance is the true explanation for an apparent subgroup effect; this is particularly true when investigators test multiple hypotheses. The more often, and more consistently, the subgroup effect is replicated in additional trials, the stronger the inference. Indeed, failure to reproduce an apparent subgroup effect has revealed the spurious nature of many previous subgroup claims. Ideally, a rigorous systematic review, which provides an overview of the subgroup findings across studies, will confirm or refute the consistency of subgroup effects. Sometimes, however, studies included in systematic reviews and meta-analyses may not provide sufficient data regarding results in the patient subgroups of interest to adequately address the issue. In such situations, meta-analyses can neither confirm nor refute the reproducibility of a subgroup analysis suggested by a single trial.
Closed Compared with Open Fractures
Our meta-analysis of five randomized trials22-26 examined the relative impact of reamed compared with unreamed nailing on the reoperation rate in open and closed fractures (Fig. 2). This review described studies suggesting that reamed nailing was superior in both open and closed fractures. However, the previous studies were small and suffered from important limitations including lack of concealment, lack of blinding of the outcome assessment, and substantial loss to follow-up. The data provide, however, no support for the subgroup hypothesis (i.e., the other studies fail to reproduce the subgroup effect of differences in the impact of reamed compared with unreamed nailing in closed compared with open fractures).
Smoking
We did not identify any observational or randomized trial evidence addressing the possibility of differential effects by smoking status. Other studies therefore fail to provide supporting evidence for our inferences regarding the smoking subgroup effect.
Is There Indirect Evidence Supporting the Hypothesized Differential Effects?
The presence of indirect evidence strengthens the beliefs of hypothesized subgroup effects. Typically, indirect evidence comprises several types of evidence, including basic science studies, physiological studies, and animal studies. Another way to describe this criterion would be: Is there a strong biological rationale for the putative subgroup effect?
The search for a biological rationale to explain an apparent subgroup effect is, given sufficient imagination, almost always successful. This limits the value of this criterion in providing compelling support for a subgroup hypothesis.
Closed Compared with Open Fractures
Intact or minimally damaged soft tissue and periosteum in the closed fractures might result in greater tolerability of reamed nailing. Thus, the added stability of reamed nailing might prove advantageous. On the other hand, devascularization in open fractures may render the bone vulnerable to the vascular compromise associated with reaming and may severely compromise the benefit of reamed nailing18.
Smoking
Neither animal nor other relevant studies exist to support the smoking subgroup hypothesis. This, however, did not impair the ability of orthopaedic surgeons, blinded to the direction of the apparent effect, to generate a compelling biological rationale. Unfortunately, other surgeons generated an equally compelling rationale for an effect in the opposite direction.
Note: Details regarding the authors and investigators are provided below.
Author Contributions
Writing Committee: The Writing Committee (Xin Sun, PhD [Chair], Diane Heels-Ansdell, MSc, Sheila Sprague, MSc, Mohit Bhandari, MD, MSc, Stephen D. Walter, PhD, David Sanders, MD, Emil Schemitsch, MD, Paul Tornetta III, MD, Marc Swiontkowski, MD, and Gordon Guyatt, MD, MSc) assumes responsibility for the overall content and integrity of the manuscript.
Study concept and design: Gordon Guyatt, Xin Sun, Mohit Bhandari, Stephen Walter, Paul Tornetta III, Emil Schemitsch, David Sanders, Marc Swiontkowski.
Analysis and interpretation of data: Xin Sun, Gordon Guyatt, Diane Heels-Ansdell, Sheila Sprague, Stephen Walter, Mohit Bhandari, Paul Tornetta III, Emil Schemitsch, David Sanders, Marc Swiontkowski.
Drafting of the manuscript: Xin Sun, Diane Heels-Ansdell, Sheila Sprague, Mohit Bhandari, Gordon Guyatt.
Critical revision of the manuscript for important intellectual content: Mohit Bhandari, Gordon Guyatt, Stephen Walter, Paul Tornetta III, Emil Schemitsch, David Sanders, Marc Swiontkowski.
Statistical analysis: Xin Sun, Diane Heels-Ansdell, Stephen Walter, Gordon Guyatt, Mohit Bhandari.
Obtained funding: Mohit Bhandari, Gordon Guyatt, Marc Swiontkowski, Paul Tornetta III.
Study supervision: Mohit Bhandari, Gordon Guyatt, Stephen Walter, Paul Tornetta III, Emil Schemitsch, David Sanders, Marc Swiontkowski.
Role of Sponsors/Funders
The funding sources had no role in the design or conduct of the study; the collection, management, analysis, or interpretation of the data; or the preparation, review, or approval of the manuscript.
SPRINT Investigators
The following persons participated in the SPRINT Study:
Study trial co-principal investigators: Mohit Bhandari, Gordon Guyatt.
Steering Committee: Gordon Guyatt (Chair), Mohit Bhandari, David W. Sanders, Emil H. Schemitsch, Marc Swiontkowski, Paul Tornetta III, Stephen Walter.
Central Adjudication Committee: Gordon Guyatt (Chair), Mohit Bhandari, David W. Sanders, Emil H. Schemitsch, Marc Swiontkowski, Paul Tornetta III, Stephen Walter.
Steering / Adjudication / Writing Committee: Gordon Guyatt (Chair), Mohit Bhandari, David W. Sanders, Emil H. Schemitsch, Marc Swiontkowski, Paul Tornetta III, Stephen Walter.
SPRINT Methods Center staff:
McMaster University, Hamilton, Ontario: Sheila Sprague, Diane Heels-Ansdell, Lisa Buckingham, Pamela Leece, Helena Viveiros, Tashay Mignott, Natalie Ansell, Natalie Sidorkewicz.
University of Minnesota, Minneapolis, Minnesota: Julie Agel.
Data Safety and Monitoring Board (DSMB): Claire Bombardier (Chair), Jesse A. Berlin, Michael Bosse, Bruce Browner, Brenda Gillespie, Peter O'Brien.
Site Audit Committee: Julie Agel, Sheila Sprague, Rudolf Poolman, Mohit Bhandari.
Study Sites
London Health Sciences Centre / University of Western Ontario, London, Ontario: David W. Sanders, Mark D. Macleod, Timothy Carey, Kellie Leitch, Stuart Bailey, Kevin Gurr, Ken Konito, Charlene Bartha, Isolina Low, Leila V. MacBean, Mala Ramu, Susan Reiber, Ruth Strapp, Christina Tieszer.
Sunnybrook Health Sciences Centre / University of Toronto, Toronto, Ontario: Hans Kreder, David J.G. Stephen, Terry S. Axelrod, Albert J.M. Yee, Robin R. Richards, Joel Finkelstein, Richard M. Holtby, Hugh Cameron, John Cameron, Wade Gofton, John Murnaghan, Joseph Schatztker, Beverly Bulmer, Lisa Conlan.
Hospital du Sacre Coeur de Montreal, Montreal, Quebec: Yves Laflamme, Gregory Berry, Pierre Beaumont, Pierre Ranger, Georges-Henri Laflamme, Alain Jodoin, Eric Renaud, Sylvain Gagnon, Gilles Maurais, Michel Malo, Julio Fernandes, Kim Latendresse, Marie-France Poirier, Gina Daigneault.
St. Michael's Hospital / University of Toronto, Toronto, Ontario: Emil H. Schemitsch, Michael M. McKee, James P. Waddell, Earl R. Bogoch, Timothy R. Daniels, Robert R. McBroom, Robin R. Richards, Milena R. Vicente, Wendy Storey, Lisa M. Wild.
Royal Columbian Hospital / University of British Columbia, Vancouver, British Columbia: Robert McCormack, Bertrand Perey, Thomas J. Goetz, Graham Pate, Murray J. Penner, Kostas Panagiotopoulos, Shafique Pirani, Ian G. Dommisse, Richard L. Loomer, Trevor Stone, Karyn Moon, Mauri Zomar.
Wake Forest Medical Center / Wake Forest University Health Sciences, Winston-Salem, North Carolina: Lawrence X. Webb, Robert D. Teasdall, John Peter Birkedal, David Franklin Martin, David S. Ruch, Douglas J. Kilgus, David C. Pollock, Mitchel Brion Harris, Ethan Ron Wiesler, William G. Ward, Jeffrey Scott Shilt, Andrew L. Koman, Gary G. Poehling, Brenda Kulp.
Boston Medical Center / Boston University School of Medicine, Boston, Massachusetts: Paul Tornetta III, William R. Creevy, Andrew B. Stein, Christopher T. Bono, Thomas A. Einhorn, T. Desmond Brown, Donna Pacicca, John B. Sledge III, Timothy E. Foster, Ilva Voloshin, Jill Bolton, Hope Carlisle, Lisa Shaughnessy.
Wake Medical Center, Raleigh, North Carolina: William T. Ombremskey, C. Michael LeCroy, Eric G. Meinberg, Terry M. Messer, William L. Craig III, Douglas R. Dirschl, Robert Caudle, Tim Harris, Kurt Elhert, William Hage, Robert Jones, Luis Piedrahita, Paul O. Schricker, Robin Driver, Jean Godwin, Gloria Hansley.
Vanderbilt University Medical Center, Nashville, Tennessee: William Todd Obremskey, Philip James Kregor, Gregory Tennent, Lisa M. Truchan, Marcus Sciadini, Franklin D. Shuler, Robin E. Driver, Mary Alice Nading, Jacky Neiderstadt, Alexander R. Vap.
MetroHealth Medical Center, Cleveland, Ohio: Heather A. Vallier, Brendan M. Patterson, John H. Wilber, Roger G. Wilber, John K. Sontich, Timothy Alan Moore, Drew Brady, Daniel R. Cooperman, John A. Davis, Beth Ann Cureton.
Hamilton Health Sciences, Hamilton, Ontario: Scott Mandel, R. Douglas Orr, John T.S. Sadler, Tousief Hussain, Krishan Rajaratnam, Bradley Petrisor, Mohit Bhandari, Brian Drew, Drew A. Bednar, Desmond C.H. Kwok, Shirley Pettit, Jill Hancock, Natalie Sidorkewicz.
Regions Hospital, Saint Paul, Minnesota: Peter A. Cole, Joel J. Smith, Gregory A. Brown, Thomas A. Lange, John G. Stark, Bruce Levy, Marc F. Swiontkowski, Julie Agel, Mary J. Garaghty, Joshua G. Salzman, Carol A. Schutte, Linda (Toddie) Tastad, Sandy Vang.
University of Louisville School of Medicine, Louisville, Kentucky: David Seligson, Craig S. Roberts, Arthur L. Malkani, Laura Sanders, Sharon Allen Gregory, Carmen Dyer, Jessica Heinsen, Langan Smith, Sudhakar Madanagopal.
Memorial Hermann Hospital, Houston, Texas: Kevin J. Coupe, Jeffrey J. Tucker, Allen R. Criswell, Rosemary Buckle, Alan Jeffrey Rechter, Dhiren Shaskikant Sheth, Brad Urquart, Thea Trotscher.
Erie County Medical Center / University of Buffalo, Buffalo, New York: Mark J. Anders, Joseph M. Kowalski, Marc S. Fineberg, Lawrence B. Bone, Matthew J. Phillips, Bernard Rohrbacher, Philip Stegemann, William M. Mihalko, Cathy Buyea.
University of Florida at Jacksonville, Jacksonville, Florida: Stephen J. Augustine, William Thomas Jackson, Gregory Solis, Sunday U. Ero, Daniel N. Segina, Hudson B. Berrey, Samuel G. Agnew, Michael Fitzpatrick, Lakina C. Campbell, Lynn Derting, June McAdams.
Academic Medical Center, Amsterdam, The Netherlands: J. Carel Goslings, Kees Jan Ponsen, Jan Luitse, Peter Kloen, Pieter Joosse, Jasper Winkelhagen, Raphaël Duivenvoorden.
University of Oklahoma Health Science Center, Oklahoma City, Oklahoma: David C. Teague, Joseph Davey, J. Andy Sullivan, William J.J. Ertl, Timothy A. Puckett, Charles B. Pasque, John F. Tompkins II, Curtis R. Gruel, Paul Kammerlocher, Thomas P. Lehman, William R. Puffinbarger, Kathy L. Carl.
University of Alberta / University of Alberta Hospital, Edmonton, Alberta: Donald W. Weber, Nadr M. Jomha, Gordon R. Goplen, Edward Masson, Lauren A. Beaupre, Karen E. Greaves, Lori N. Schaump.
Greenville Hospital System, Greenville, South Carolina: Kyle J. Jeray, David R. Goetz, David E. Westberry, J. Scott Broderick, Bryan S. Moon, Stephanie L. Tanner.
Foothills General Hospital, Calgary, Alberta: James N. Powell, Richard E. Buckley, Leslie Elves.
Saint John Regional Hospital, Saint John, New Brunswick: Stephen Connolly, Edward P. Abraham, Donna Eastwood, Trudy Steele.
Oregon Health and Sciences University, Portland, Oregon: Thomas Ellis, Alex Herzberg, George A. Brown, Dennis E. Crawford, Robert Hart, James Hayden, Robert M. Orfaly, Theodore Vigland, Maharani Vivekaraj, Gina L. Bundy.
San Francisco General Hospital, San Francisco, California: Theodore Miclau III, Amir Matityahu, R. Richard Coughlin, Utku Kandemir, R. Trigg McClellan, Cindy Hsin-Hua Lin.
Detroit Receiving Hospital, Detroit, Michigan: David Karges, Kathryn Cramer, J. Tracy Watson, Berton Moed, Barbara Scott.
Deaconess Hospital Regional Trauma Center/Orthopaedic Associates, Evansville, Indiana: Dennis J. Beck, Carolyn Orth.
Thunder Bay Regional Health Science Centre, Thunder Bay, Ontario: David Puskas, Russell Clark, Jennifer Jones.
Jamaica Hospital, Jamaica, New York: Kenneth A. Egol, Nader Paksima, Monet France.
Ottawa Hospital—Civic Campus, Ottawa, Ontario: Eugene K. Wai, Garth Johnson, Ross Wilkinson, Adam T. Gruszczynski, Liisa Vexler.