Abstract
Abstract:
Randomized controlled trials (RCTs) constitute the gold standard for the generation of evidence-based medicine, but may not always be feasible. Furthermore, randomization alone does not guarantee the utility of the research, as evidenced by thousands of uninformative RCTs documented in the literature. Observational studies, including longitudinal, retrospective, and case-control designs, can contribute to the body of evidence in meaningful ways, provide useful information when an RCT is unethical or not feasible, generate hypotheses for RCTs, or provide preliminary work to better inform design of future RCTs. They can also be used to study rare outcomes, risk factors, and side effects, and to examine whether results from RCTs translate into effective treatment in routine practice. Use of modern statistical techniques, both in the study design and in the analysis stage, can improve the usefulness of the evidence obtained from observational studies.
Orthopaedic surgery, like other fields of clinical practice, is being transformed by the rise of evidence-based medicine (EBM), which is the application of scientific principles to the practice of medicine. As a consequence, there is increasing demand for scientific studies that deliver unconfounded measures of treatment effect (i.e., internal validity) and are as generalizable as possible in terms of setting and population (i.e., external validity). While randomized controlled trials (RCTs) offer the highest internal validity and are considered to be the gold standard for EBM, there are many settings where such studies are unethical, not feasible, or lack sufficient external validity. Thus, as we progress further into the era of RCTs, observational studies will continue to play a key role in the generation of EBM.
In this paper, we review the limitations of RCTs and examine the question of whether observational studies can fill in the gaps of EBM when RCTs are not able to do so. We also examine the major types of observational designs and the conditions under which such designs can provide useful information and contribute to EBM, and review some of the key methodological approaches that should be kept in mind when considering the analysis of observational studies. We conclude by discussing our own experience with two large multicenter observational research efforts: the National Study on Costs and Outcomes of Trauma (NSCOT) and the Lower Extremity Assessment Project (LEAP). Finally, we review the studies being conducted by the Major Extremity Trauma Research Consortium (METRC). While METRC was created with the explicit goal of advancing orthopaedic trauma care through clinical trials, observational studies are a major part of our research portfolio. The METRC experience illustrates the key role observational studies play in the generation of EBM.
Because of the randomization of treatment assignment, RCTs have the highest internal validity among study designs. In theory, randomization ensures that the treatment groups are balanced with respect to both measured and, more importantly, unmeasured factors associated with outcomes. Because of the power of randomization, RCTs rank as major contributors to the hierarchy of evidence (Fig. 1), and systematic reviews of RCTs constitute the definitive source of EBM.
Despite the increased emphasis on randomized studies, EBM is faced with an “evidence paradox.” While it is estimated that over 18,000 RCTs are published each year, many systematic reviews, health technology assessments, and clinical guideline development efforts conclude that the available evidence is limited or studies are of poor quality. This suggests that while the RCT is certainly the gold standard for ensuring internal validity, the design suffers from a number of practical limitations.
In settings where RCTs are both ethical and feasible, they also have the potential for low external validity. While there is no inherent advantage for observational studies with respect to generalizability, the many requirements placed on RCTs by the risks of conducting clinical experiments on humans generally affect the setting and population that are used in these types of studies. The constraints and difficulties of enrolling patients into clinical experiments often result in study designs that have little relevance to real-world practice. These constraints include achieving equipoise among clinicians, identifying (and enrolling) a narrow segment of patients with the appropriate balance of risks and benefits, and the infeasibility of powering the study to be able to detect treatment effects in heterogeneous subpopulations. Typically, this results in trials that exclude a substantial portion of the target population, and studies with highly intensive or regimented treatment protocols that do not reflect the realities of routine clinical practice.
Given the limitations of RCTs, a key question is whether observational study designs can fill in the gaps of EBM. The credibility of observational epidemiology has been challenged because of high profile instances where the results of observational studies have been viewed as being contradicted by subsequent randomized trials. A recent example is the controversy surrounding hormone replacement therapy (HRT) and its relationship to coronary heart disease (CHD). In early observational studies, HRT was found to confer protection from CHD, and these results were used to guide clinical practice. However, subsequent randomized trials found that HRT conferred either no effect or a small increased risk of CHD1. There is an explanation for these apparent discrepancies. First, many observational studies inadequately controlled for confounding factors. The studies that failed to adjust for socioeconomic status tended to show that HRT was protective for CHD. This is because women who take HRT are more likely to have higher socioeconomic status; women with higher socioeconomic status tend to have healthier lifestyles and better access to preventive services, and therefore have a lower risk for CHD. Second, there is great heterogeneity in the definitions of HRT exposure (e.g., ever, past, recent, or current) across observational studies, making comparisons to HRT exposure in randomized studies, which are more carefully controlled and monitored, less clear. Overall, meta-analyses of the observational studies that accounted for socioeconomic status resulted in the same conclusions as the randomized studies1.
Similarly, other studies have found that the results of observational studies have been well replicated by randomized trials. Benson and Hartz summarized the evidence from fifty-three observational studies and eighty-three RCTs for nineteen therapeutic comparisons2. In only two of the nineteen analyses did the combined effect in observational studies lie outside the 95% confidence interval for the combined effect in the RCTs. Within orthopaedics, Bhandari et al. examined mortality and revision rates in twenty-seven studies (thirteen observational studies and fourteen randomized trials), comparing arthroplasty and internal fixation for patients with femoral neck fractures3. They found that the observational studies that controlled for important risk factors in the analysis had results similar to RCTs.
Overall, it appears that well-designed and analyzed observational studies can yield important information about treatment effects. They are particularly useful when an RCT is unethical or not feasible. They can also generate hypotheses for RCTs; provide preliminary work to better inform design of future RCTs; and can be used to study rare outcomes, risk factors, and side effects. Finally, observational studies can also be used to examine whether RCT results translate into effective treatment in routine practice.
There are three main types of observational studies: cohort studies, case-control studies, and case series. Cohort studies provide the highest level of evidence among observational designs. Cohort studies can be further divided into prospective and retrospective designs. Prospective cohort studies collect baseline data before treatment, and then patients are followed up over time for ascertainment of key end points. Retrospective cohort studies typically abstract data from medical records about treatment and baseline risk, even if outcomes are collected prospectively. In general, prospective data collection improves the validity of the study results because one can be sure about the temporal relationship between observations: data that were gathered at baseline, treatment, and follow-up occurred at separate, known time points. Furthermore, there is an expectation that prospective data is of higher quality and that data collection instruments have been tailored to answer the scientific questions. In retrospective designs, baseline and treatment data are generally collected at the same time point, and generally at least some of the data is gathered from nonoptimal sources, which may not contain all of the desired information or be subject to measurement error. Thus, it may be impossible to establish the temporal relationship between the baseline and treatment variables, adequately control for confounding, or have accurate information about exposure to treatment.
The second category of observational studies is case-control designs, which are useful when studying rare outcomes. In case-control studies, patients with (cases) and without (controls) disease are identified; exposure status is then ascertained for these individuals. To minimize bias, the control group must be carefully selected. Matching the controls to the cases with respect to risk factors is a common way of selecting a control group. In fact, there are computerized matching algorithms that make it possible to perform matching on dozens of factors4. The same concerns raised about retrospective cohort studies apply to case-control studies. Case-control designs cannot be used to draw definitive inference about treatment effects.
Finally, case series can be primarily useful for providing information on rare complications and adverse events that could not feasibly be studied another way, or providing data for hypothesis generation, leading to additional studies. Case series generally make no attempt to control for selection bias, and are thus not appropriate to make inference about treatment effects.
Here we focus on observational data where there is a clear temporal relationship between risk factors, treatment, and outcomes (e.g., continuous, binary, or count). That is, risk factors precede the treatment decision, which precedes the outcomes. To draw causal inferences from observational data, it is useful to conceptualize the outcomes an individual would have under the competing treatment modalities. These are called potential outcomes (only one of which is observed, the other being counterfactual). The causal effect is defined as a contrast between the population average of the outcome under one treatment versus the population average of the outcome under a competing treatment. This effect is not conditional on covariates. To estimate this contrast, it is typically assumed that there are no unmeasured confounders, which essentially means that the treatment decision is like the flip of a coin, with the probability of “heads” depending only on factors that are measured and available prior to the decision. Under this assumption, methods such as propensity scores, inverse probability of treatment weighting, and G-computation techniques can be employed to estimate the causal contrast of interest5-8. Instrumental variables can be used to estimate a different type of causal effect and do not require the assumption of no unmeasured confounding.
The propensity score for an individual is the conditional probability of receiving treatment given the individual’s measured risk factors. Under the assumption of no unmeasured confounders, it has been shown that, within levels of propensity scores, the distribution of risk factors is the same for treated and untreated individuals. That is, within levels of propensity scores, there is no confounding. The data are analyzed by separating patients into five to ten strata based on the values of their propensity scores. Within each stratum, the mean outcomes for treated and untreated individuals are computed. Then, overall means for treated and untreated individuals are computed by taking a weighted average of the strata-specific means, where the strata-specific weights are the proportion of individuals in that stratum. Contrasts between the overall means for treated and untreated individuals are an estimator of the causal effect of treatment.
Another approach is to weight the outcomes of treated and untreated individuals by the inverse of the conditional (on risk factors) probability of receiving the treatment they actually received. For treated patients, the weight is the inverse of their propensity scores, and for untreated patients, the weight is the inverse of one minus their propensity scores. Contrasts between the weighted means for treated and untreated individuals are an estimator of the causal effect of treatment. This approach is called “inverse probability of treatment weighting” and was used in the analysis of the NSCOT study (described below).
The G-computation method is another approach, which is related to the traditional regression adjustment approach. A regression model for the mean outcome is fit with treatment indicator and risk factors (or just the propensity score) as covariates. Interactions between treatment and risk factors are allowed. For each individual in the study, a mean outcome is predicted under treatment and no treatment by setting the treatment indicator variable to one and zero, respectively, and using the individual’s risk factors. A contrast between the averages of the predicted means under treatment and no treatment is an estimator of the causal effect. In a linear regression model (for a continuous outcome) in which there is no interaction between treatment and risk factors, the coefficient for treatment is an estimator of the causal effect (where the contrast is a difference in means).
The instrumental variable approach is another approach to estimating causal effects from observational data. This approach does not require the assumption of no unmeasured confounders. Rather, it assumes that one has identified an instrumental variable, which is an exogenous factor that is associated with treatment, but whose effect on outcome is completely mediated through a treatment’s effect on outcome9. A classic example of an instrumental variable is distance, in a setting where distance impacts access to treatment, but distance itself has no effect on outcomes other than through its relationship with treatment. The analysis does not estimate the causal effect of treatment for the entire population, but for the subset of patients for whom the instrument changes the treatment decision.
NSCOT was designed to answer the question of whether regional trauma systems improve patient outcomes. The specific goal of the study was to assess whether there were differences in outcomes associated with treatment at hospitals with a level-I trauma center and hospitals without a trauma center. The main hypothesis was that the risk of death would be lower if all patients had been treated at a trauma center as compared with a hospital without a trauma center. A randomized study to answer this question would have been impossible, since no study could feasibly randomize patients to different levels of trauma care. Instead, NSCOT was an observational study conducted in fifteen metropolitan statistical areas (MSAs) in twelve states, which was designed to provide a representative sample of all trauma hospitalizations in these MSAs. Over 5000 patients from sixty-nine hospitals were included in NSCOT, which was a representative sample of all trauma survivors and all hospital deaths within the fifteen studied MSAs10.
Analytically, NSCOT required investigators to adjust for observable differences between patients treated at trauma centers and those treated at hospitals without a trauma center since there was an expectation that trauma center patients would be more severely injured. NSCOT used the inverse probability of treatment weighting approach described above. After adjustment for differences in the case mix, the overall risk of death was 25% lower at trauma centers compared with nontrauma centers10. Further analysis from the study demonstrated that trauma centers are cost-effective11 and may also help improve outcomes in a subset of the population12. Despite using an observational design, the NSCOT study provides the best data available today demonstrating the effectiveness of regional trauma systems.
LEAP was a multicenter prospective longitudinal cohort study of severe lower-extremity trauma in the civilian population of the United States. A number of functional and clinical outcomes were assessed for 601 patients who underwent reconstruction or amputation within three months following severe limb-threatening lower-extremity trauma. While not explicitly framed this way, the main goal of the LEAP study was to assess whether patients undergoing reconstruction would have been better served by undergoing early amputation13. Randomization in this study was regarded as both infeasible and unethical, given that patients often had strong preferences regarding their treatment, and there was insufficient clinical evidence on which to develop a case for equipoise. However, it was believed that, because of substantial treatment variation and partially due to these patient preferences, many patients with comparable injury severity profiles would enter distinct treatment pathways.
The causal effect of interest in LEAP is different from the formulation discussed above. This is because it is not useful to conceptualize an amputee’s outcome under salvage, as not all amputees suffer from an injury that is salvageable. Formally, the effect of interest is a contrast between the mean outcome among salvage patients and the mean outcome these patients would have had with an amputation. An estimate of this effect on a difference scale that adjusts for confounding can be computed, as it was in the LEAP study, using a linear regression model for the outcome, with amputation and the propensity score as covariates (without an interaction term). In this case, the estimated amputation coefficient was the estimated causal effect. The other methods discussed above can be also adapted to provide alternative estimates of the causal effect.
Overall, after adjustment for injury and patient characteristics, the LEAP study found that the functional outcomes for salvages would have been no better than had they been amputated14. Beyond the main study hypothesis, data from the LEAP study have been successfully used to better characterize the outcomes of lower-extremity trauma patients. LEAP data have been used to examine the roles of factors as varied as time to treatment15 and smoking on risk of surgical site infection16, as well as psychological distress17, pain18, and patient satisfaction on functional outcomes19. The data have been used to carefully characterize costs10 as well as to investigate physical therapy utilization20 and complications21 in this population. Equally important, LEAP data have provided the impetus for the development of currently ongoing interventions and trials. The results of the LEAP study strongly suggest that major improvements in functional outcome require interventions in the early phase of recovery that directly address the patients’ psychosocial needs and assist them in self-managing the multifactorial consequences of injury. These conclusions guided the development of two interventions: the Trauma Survivors Network22 and the NextSteps program, a self-management intervention based on the principles of cognitive-behavioral theory23.
METRC was established in September of 2009 with funding from the Department of Defense and the Orthopaedic Extremity Trauma Research Program (OETRP). METRC is a large five-year research effort to develop and conduct clinical trials relevant to the treatment and outcomes of orthopaedic trauma.
METRC consists of a network of clinical centers and one data-coordinating center that will work together with the U.S. Army Institute of Surgical Research and the OETRP to conduct multicenter clinical research studies relevant to the treatment and outcomes of orthopaedic trauma sustained in the military. The overall goal of the consortium is to produce the evidence needed to establish treatment guidelines for the optimal care of the wounded warrior and ultimately improve the clinical, functional, and quality-of-life outcomes of both service members and civilians who sustain high-energy trauma to the extremities.
Despite a strong impetus for the development of randomized trials and substantial funding, observational studies will play a major role in the METRC portfolio. The first rounds of METRC focused on seven key areas of orthopaedic trauma research: internal versus external fixation for bone reconstruction, autograft versus allograft for segmental bone loss, topical antibiotics for the prevention of infections, novel technologies for the diagnosis and prevention of compartment syndrome, limb salvage versus amputation for severe ankle fractures, multimodal pharmacologic perioperative pain management, and collaborative care for the prevention of the negative psychological sequelae of trauma. Of these seven research topics, four include a significant observational component. A brief description of the design and rationale for the observational component for each of these is described in Table I.
Randomization remains the gold standard for increasing internal validity and maximizing the strength of the evidence from clinical studies. However, there are a number of circumstances under which randomization may not be feasible or ethical. A strong observational study can contribute to the body of evidence in a meaningful way and pave the way for stronger randomized designs in the future.
Humphrey
LL;
Chan
BK;
Sox
HC. Postmenopausal hormone replacement therapy and the primary prevention of cardiovascular disease. Ann Intern Med.
2002;137:273-84.[PubMed]
Benson
K;
Hartz
AJ. A comparison of observational studies and randomized, controlled trials. Am J Ophthalmol.
2000;130:688.[CrossRef][PubMed]
Bhandari
M;
Richards
RR;
Sprague
S;
Schemitsch
EH. The quality of reporting of randomized trials in the Journal of Bone and Joint Surgery from 1988 through 2000. J Bone Joint Surg Am.
2002;84:388-96.[CrossRef][PubMed]
Bergstralh
EI;
Kosanke
JL. Technical report 56. Computerized matching of controls. Mayo Foundation
; 1995. 4-27.
Rosenbaum
PR;
Rubin
DB. The central role of the propensity score in observational studies for causal effects. Biometrika.
1983;70:41-55.[CrossRef]
Curtis
LH;
Hammill
BG;
Eisenstein
EL;
Kramer
JM;
Anstrom
KJ. Using inverse probability-weighted estimators in comparative effectiveness analyses with observational databases. Med Care.
2007;45(
10 Supl 2):S103-7.[CrossRef][PubMed]
Robins
JM;
Hernán
MA;
Brumback
B. Marginal structural models and causal inference in epidemiology. Epidemiology.
2000;11:550-60.[CrossRef][PubMed]
Newhouse
JP;
McClellan
M. Econometrics in outcomes research: the use of instrumental variables. Annu Rev Public Health.
1998;19:17-34.[CrossRef][PubMed]
Mackenzie
EJ;
Rivara
FP;
Jurkovich
GJ;
Nathens
AB;
Frey
KP;
Egleston
BL;
Salkever
DS;
Weir
S;
Scharfstein
DO. The National Study on Costs and Outcomes of Trauma. J Trauma.
2007;63(
6 Suppl):S54-67; .[CrossRef][PubMed]
MacKenzie
EJ;
Weir
S;
Rivara
FP;
Jurkovich
GJ;
Nathens
AB;
Wang
W;
Scharfstein
DO;
Salkever
DS. The value of trauma center care. J Trauma.
2010;69:1-10.[CrossRef][PubMed]
Mackenzie
EJ;
Rivara
FP;
Jurkovich
GJ;
Nathens
AB;
Egleston
BL;
Salkever
DS;
Frey
KP;
Scharfstein
DO. The impact of trauma-center care on functional outcomes following major lower-limb trauma. J Bone Joint Surg Am.
2008;90:101-9.[CrossRef][PubMed]
MacKenzie
EJ;
Bosse
MJ. Factors influencing outcome following limb-threatening lower limb trauma: lessons learned from the Lower Extremity Assessment Project (LEAP). J Am Acad Orthop Surg.
2006;14(
10 Spec No.):S205-10.[PubMed]
Bosse
MJ;
MacKenzie
EJ;
Kellam
JF;
Burgess
AR;
Webb
LX;
Swiontkowski
MF;
Sanders
RW;
Jones
AL;
McAndrew
MP;
Patterson
BM;
McCarthy
ML;
Travison
TG;
Castillo
RC. An analysis of outcomes of reconstruction or amputation after leg-threatening injuries. N Engl J Med.
2002;347:1924-31.[CrossRef][PubMed]
Pollak
AN. Timing of débridement of open fractures. J Am Acad Orthop Surg.
2006;14(
10 Spec No.):S48-51.[PubMed]
Castillo
RC;
Bosse
MJ;
MacKenzie
EJ;
Patterson
BM; LEAP Study Group. Impact of smoking on fracture healing and risk of complications in limb-threatening open tibia fractures. J Orthop Trauma.
2005;19:151-7.[CrossRef][PubMed]
McCarthy
ML;
MacKenzie
EJ;
Edwin
D;
Bosse
MJ;
Castillo
RC;
Starr
A; LEAP study group. Psychological distress associated with severe lower-limb injury. J Bone Joint Surg Am.
2003;85:1689-97.[PubMed]
Castillo
RC;
MacKenzie
EJ;
Wegener
ST;
Bosse
MJ; LEAP Study Group. Prevalence of chronic pain seven years following limb threatening lower extremity trauma. Pain.
2006;124:321-9. .[CrossRef][PubMed]
O’Toole
RV;
Castillo
RC;
Pollak
AN;
MacKenzie
EJ;
Bosse
MJ; LEAP Study Group. Surgeons and their patients disagree regarding cosmetic and overall outcomes after surgery for high-energy lower extremity trauma. J Orthop Trauma.
2009;23:716-23.[CrossRef][PubMed]
Castillo
RC;
MacKenzie
EJ;
Webb
LX;
Bosse
MJ;
Avery
J; LEAP Study Group. Use and perceived need of physical therapy following severe lower-extremity trauma. Arch Phys Med Rehabil.
2005;86:1722-8.[CrossRef][PubMed]
Harris
AM;
Althausen
PL;
Kellam
J;
Bosse
MJ;
Castillo
R; Lower Extremity Assessment Project (LEAP) Study Group. Complications following limb-threatening lower extremity trauma. J Orthop Trauma.
2009;23:1-6.[CrossRef][PubMed]
Bradford
AN;
Castillo
RC;
Carlini
AR;
Wegener
ST;
Teter
H
Jr;
Mackenzie
EJ. The trauma survivors network: Survive. Connect. Rebuild. J Trauma.
2011;70:1557-60.[CrossRef][PubMed]
Lorig
KR;
Holman
H. Self-management education: history, definition, outcomes, and mechanisms. Ann Behav Med.
2003;26:1-7.[CrossRef][PubMed]