In sound surgical practice, decisions about patient care are made by integrating the specific clinical circumstances, the values and preferences of the patients, and the best available research evidence. Research evidence can be derived through various routes, including physiological experiments, individual observation and expert opinion, observational studies, randomized controlled trials, systematic reviews, and meta-analyses. It is therefore important for a clinician to understand what is considered "higher-quality" evidence before applying it to his or her practice.
The approach we are describing is evidence-based medicine, a concept introduced by Guyatt in 19911 and fleshed out by the Evidence-Based Medicine Working Group at McMaster University in 19922. This new paradigm placed less emphasis on expert opinion and unsystematic clinical observations, instead stressing evidence derived from clinical research and emphasizing the need for physicians to critically appraise published results before incorporating the conclusions of research into their own practices.
The relative quality of sources of information, on the basis of study design, was described by Sackett et al. in 19963,4 and can be characterized as a hierarchy of evidence (Fig. 1). Increased rigor of design corresponds to a higher position in the hierarchy1 as the quality of the evidence increases and the opportunity for bias and confounding decreases4-6. As a result, randomized controlled trials are considered the gold standard for study design together with their associated systematic reviews and meta-analyses, which combine the results of multiple randomized controlled trials.
In orthopaedics, the terminology collectively called evidence-based orthopaedics has become standard. In fact, both the American volume of The Journal of Bone and Joint Surgery (JBJS) and the Journal of Orthopaedic Trauma, in 2000 and 2003, respectively, introduced the special sections "Evidence-Based Orthopaedics"7 and "Evidence-Based Orthopaedic Trauma."8 Introducing the new section in JBJS, the editors' position was that randomized controlled trials would form the main contribution to evidence-based orthopaedics because they are believed to provide the highest-quality evidence. However, this scope was enlarged in the Journal of Orthopaedic Trauma to include the results of observational studies. The editors wrote that: "Often the best available evidence may come from nonrandomized studies that are important in developing hypotheses for future research."8 The same view was expressed in a primer on evidence-based orthopaedics as: "[There is a] large body of state of the art evidence [that] is derived from observational studies."5
This is the perspective adopted in this supplement and in this paper, in which we attempt to elucidate the important source of information derived from observational studies in furthering the boundaries of orthopaedic surgery and the knowledge of musculoskeletal disorders. This view echoes the broad principles of evidence-based medicine for informed utilization of all types of evidence in patient care1,3,6. Given the demands of a busy practice and the rapid growth in published literature, surgeons need to develop the ability to assess the validity of all available studies when applying research evidence to their own patients.
In this paper, we discuss where observational studies fit in the hierarchy and provide examples to show why they form a vital source of information for the orthopaedic surgeon.
The purpose of scientific studies is to establish a cause-and-effect relationship among variables. Typically, one or more variables are considered to be independent (in the sense that they are under some control by the investigator) and the others are dependent. For instance, in considering the management of displaced fractures of the clavicle, the independent variable could be the form of treatment (that is, whether to treat surgically or nonsurgically) while the dependent variable could be subsequent development of malunion (we will return to this example later)9.
In any study, the conclusions that are drawn will only be applicable to the population from which the subjects are drawn and the manner in which they are selected. If the subjects represent a random sample from a target population, then the conclusions of the study will apply to that population. Conclusions are derived with use of statistical methods that are valid under well-defined assumptions, and we can make the general statement that what distinguishes observational studies from randomized controlled trials is the validity (or lack of the same) of some of these assumptions.
Statistical significance alone does not necessarily imply a cause-and-effect relationship, nor even an association, unless the presence of confounding factors can be eliminated. These are variables that are correlated with both dependent and independent variables. A well-designed trial mitigates the effect of any confounding variables by adhering to three principles of experimental design: control, randomization, and replication. Control refers to the selection of a sample that is as homogeneous as possible with respect to any confounding variables. Randomization refers to the allocation of subjects to groups (different levels of the independent variable) by chance (and possibly with blinding) as opposed to allocation at the discretion of the investigator6. Randomization is used to average out the effect of any confounding variables, known or unknown, that have not been controlled for. Finally, replication refers to the sample size. Typically, the larger the sample, the more accurate is the estimate of error and so the more powerful the procedure. A good design should control for as many variables as possible, randomize treatments among the subjects in order to average out the effect of uncontrolled variables, and, finally, have a large sample size to reduce the estimate of error.
A well-designed trial thus accounts for possible bias, which is defined in statistics as a systematic overvalue or undervalue of a variable. Bias may lead to an apparent association when one does not exist. Randomization usually reduces the effect of unintentional bias, but not always, as even a randomized study could be biased if the variables involved subjective recollection by patients or an interpretation by a physician of a borderline response. The latter can be compensated for with use of double-blinded designs in randomized controlled trials6,10.
Randomized controlled trials were designed to find evidence to answer a therapeutic question. There are numerous illustrations of this type of trial; one example that is particularly relevant to orthopaedics is the comparison of low-molecular weight heparin (enoxaparin) and dose-adjusted warfarin with regard to their efficacy in preventing venous thromboembolic disease following hip11 or knee12 replacement. In both studies, enoxaparin was found to be superior in reducing the occurrence of venous thromboembolism while patients were in the hospital, providing evidence that enoxaparin, and not warfarin, should be prescribed postoperatively.
Since randomized controlled trial designs lie at the top of the pyramid, there is a misconception that the application of evidence-based medicine is equivalent to extracting results only from randomized controlled trials. As a result, many clinicians may limit their searches by publication type to "randomized controlled trials" in order both to reduce the number of articles retrieved in their search set13 and to limit the searches to what they perceive to be of highest quality. In fact, important information can be derived from observational studies, and it is incumbent on us to understand the strengths and weaknesses of nonrandomized designs so that results from those designs will not just be disregarded.
Observational studies are distinguished from randomized controlled trials by the fact that the researcher does not assign the subjects to treatment or control groups; rather, they are included by a nonrandomized method. Within observational studies, there are three main categories: cohort studies, case-control studies, and case series and/or case reports. Other papers in this supplement will deal specifically with design issues for each, so we will briefly define and discuss these types of studies before explaining how they fit in the hierarchy of evidence. We begin at the top level within observational studies and then move down the pyramid.
Cohort Studies
A cohort is a group of individuals identified as having a particular common exposure (note: the word exposure is used generically here; for instance, it may refer to a risk factor, a prognostic factor, or a specific type of surgery). Cohort studies deal with two or more cohorts; in the simplest case, one group has been exposed and the other has not. These groups are followed forward prospectively and then are observed for the outcome(s) of interest (Fig. 2). The rate of outcome occurrence for each cohort can be calculated, and a relative risk is computed (Table I).
The terminology "observational" connotes that the data were collected without interference, so the decision to assign patients to an exposure is not random and may be self-determined by confounding variables that are highly correlated with the outcome. As a result, an observed significant difference in outcomes may not necessarily be attributable to the choice of treatment alone but to the confounding variable (or to some combination). However, it may still be possible to try to match the groups to limit the bias as much as possible, and a well-designed cohort study can have considerable impact.
As an example, Moran et al.14 conducted a prospective study to determine whether a delay in surgery affected mortality in patients with hip fracture; they undertook this study because hip fracture is associated with a high mortality rate in the elderly and because surgery within twenty-four hours of admission has been recommended15. The study comprised two main cohorts: patients who had surgery within a day of admission, and patients whose operation was delayed, generally because operating-room time was unavailable. Many other variables were recorded, including preinjury mobility, as these factors contribute to the observed outcomes. The authors found that a delay in surgery of one to four days had no adverse effect on thirty-day mortality but that a delay of more than four days significantly increased mortality after ninety days (p = 0.01) as well as after one year (p = 0.001). In the conclusion of their paper, the authors stated that patients with hip fracture must be given priority to have surgery within four days of admission but that there is no observed difference in delay during the initial four days.
Although this study is observational, the protocol by which the cohorts were chosen approached what may occur in actual hospital practice (since it was made on the basis of available operating-room time), giving the paper good external validity. Care was taken to exclude patients who had confounding variables; patients judged unfit for surgery within twenty-four hours because of injury, anesthesic risk, or acute medical comorbidity were excluded as they were expected to have poorer outcomes and so would inflate the risk rate in the delayed group. So, this particular cohort study approaches a randomized controlled trial with regard to quality. In fact, it would be difficult or unethical to carry out a randomized controlled trial to answer this question because neither the patients nor their physicians would want to delay treatment for the sake of a trial.
Case-Control Studies
Sometimes practical considerations prevent the use of a randomized controlled trial or a cohort design, and an investigator may consider the use of a case-control design instead. A case-control design begins with a group of individuals who have experienced a particular outcome (the dependent variable). This study group is then matched with a control group of patients who are similar to the study patients with respect to important known risk factors but have not experienced the outcome (Fig. 3). The two groups are then examined for any relationships between the unmatched (independent) variables and the outcome. This design provides a lower level of evidence than a cohort study because risk probabilities cannot be estimated and because such studies are often carried out retrospectively with use of medical records or patient recall, which may introduce bias due to incomplete or incorrect data.
In the simplest case, the data can also be described with use of the same two-by-two table as used in Table I for a cohort study, but here the risk for each group cannot be estimated; although it is known how many had the outcome of interest, it is unknown how many subjects were in the hypothetical starting sample from which the cases emerged. As a result, the relative risk cannot be computed, but the odds ratio can be calculated as OR=adbc, where a, b, c, and d are the outcome entries shown in Table I. When the outcome rate is small in both groups, then the relative risk and odds ratio are approximately the same6.
Important situations in which case-control studies are useful arise when the outcome being studied has a low prevalence, which would require an enormous sample size in a cohort study, or when the time needed to observe an outcome is very long. Therefore, this design is used to identify adverse events6,16. As in cohort studies, the main consideration in carrying out case-control studies is to eliminate or reduce confounding and bias during the selection of controls in order to assure both internal and external validity.
For example, a case-control design was used to identify risk factors associated with surgical site infection following spinal surgery17. Patients in whom infection developed were identified retrospectively and then compared and matched with control patients who underwent spinal surgery but did not develop infections. Diabetes, elevated preoperative or postoperative serum glucose levels, and suboptimal timing of prophylactic antibiotics were identified, on the basis of the odds ratios, as significant risk factors. On the basis of this study, the authors recommended that hyperglycemia without the diagnosis of diabetes should be investigated further as a risk factor for infection.
Case Series and Case Reports
Case reports are clinical observations of events that have been observed in a single patient, and case series are a collection of related case reports, often from the same hospital. They may discuss an interesting patient, or, most commonly, an adverse event associated with treatment. These studies are retrospective and without controls; they are merely a detailed description of the medical history of the patient and the observations of the physician. Therefore, it is not possible to ascribe the outcomes to the particular treatment that was administered. There are exceptions to this general rule. Consider a situation in which a patient's course is so predictable that, without treatment, one outcome is virtually certain, yet, with treatment, there is a clear divergence in the final result. It would be very easy to conclude a cause-and-effect relationship but, without use of controls, the reason for the outcome could be some unaccounted-for physical or biological feature of the patient. This is why clinicians should be very wary about drawing conclusions from these types of studies. Nonetheless, such a case report would suggest a potential line of investigation at a more scientific level.
Related to case series and case studies are registries, which are databases of cases that are submitted to a central repository by surgeons, usually through a web-based interface. For hip and knee replacements, such registries already exist in Australia, New Zealand, Canada, Great Britain, and the Scandinavian countries, but currently not in the United States. For example, the Swedish Hip Arthroplasty Register was begun in 1979 to record primary hip arthroplasty outcomes and to identify possible risk factors for poor outcomes, such as patient characteristics, fixation mode, implant type, and surgical technique18. The enormous magnitude of data in a registry allows the use of modern techniques of data mining to identify potential problems at an early stage of an implant's use. Typically, registries have no inclusion or exclusion criteria, and guidelines have been developed for ensuring that the data are as complete as possible. In the future, registries may form an important source of information, especially as surgeons rely increasingly on information gleaned from the Internet in an effort to improve their own practice by comparing their results with those of other surgeons on a large-scale basis. Within the hierarchy of evidence, registry studies can be grouped with case series, and sometimes just above them.
Observational studies that are designed well can limit the bias and confounding associated with nonrandomization and provide valuable information. As mentioned above, randomized controlled trials are best suited for answering a therapeutic question. But there are other types of questions for which answers are needed for the treatment of patients—questions concerning the etiology of a disease, its natural history, the identification of prognostic factors, and the possibility of adverse treatment effects. For most of these questions, it is difficult to design a randomized controlled trial to collect quality data.
Ideally, it would be preferable to use a randomized controlled trial design, rather than other designs, to answer a scientific question, but in the real world this is just not practical. As we will show, the biggest obstacles to carrying out a randomized controlled trial are ethical considerations, practicality, sample size, and cost.
Ethics
Because of the ethical dilemma of assigning patients to potential risk factors that will likely lead to a worse outcome, it is uncommon to use randomized controlled trials for the purpose of determining the risk associated with prognostic factors. Similarly, it is unethical to randomize treatment groups to management that may be harmful. For example, it would be wrong to randomize patients to an unhealthy or healthy diet to examine the effect on fracture-healing.
Consider the effect of smoking on the healing of open tibial fractures. In a number of the studies cited by Castillo et al.19, the authors suspected that there was a positive correlation between smoking and healing complications, but none had looked prospectively at how smoking affected time to union. These authors carried out a prospective, multicenter trial on patients who had unilateral open tibial fractures and who were at risk of needing amputation. They found that current and previous smokers were 37% and 32% (p = 0.01 and p = 0.04, respectively) less likely to achieve union than nonsmokers and that the risk of the development of osteomyelitis was three to four times more likely in smokers than in nonsmokers.
It would be impossible to design a randomized controlled trial to determine the effect of smoking on fracture-healing because we cannot force patients to smoke or to refrain from smoking. Even if somehow patients would agree to this, for instance, in exchange for monetary compensation, it is highly unlikely that an ethics review board would sanction such a trial.
Another example of a study question that would be unethical is the effect of inhaled corticosteroids on the subsequent risk of fracture in children and adolescents. This type of study poses a challenge because an investigator would not be able to assign symptomatic children in need of medication to the control group. Similarly, due to the possible negative side effects associated with use of corticosteroids, it would not be right to expose children to inhaled steroids if they had no underlying lung disease. This question has been broached through case-control20 and retrospective cohort21 studies, the results of which indicate that there is an association between inhaled steroids and fracture risk, but this association is most likely due to the underlying lung disease than to the corticosteroids directly.
Sample Size
When one is dealing with infrequent outcomes or rare adverse events, a randomized controlled trial would require a very large sample size to detect any significant differences among treatments. There is a practical prohibition against carrying out randomized controlled trials in such circumstances because they would require an enormous amount of funding and efficient coordination among the staff involved with the study. Although there are examples of large-scale randomized controlled trials, these are typically drug trials22-24, which are more likely to have financial support from pharmaceutical companies25. Even trying to carry out a nonrandomized prospective study poses the same difficulties with regard to sample size.
On the other hand, there may be existing observational study data, possibly retrospective, that could provide an alternative analysis when a prospective randomized controlled trial cannot be carried out. These data may be ideally suited to a retrospective case-control study, in which patients are identified because they have the rare outcome that is sought.
Long-Term Outcomes and Prognosis
Similarly, to study the long-term outcomes of a surgical procedure, one would need to plan in advance for years of follow-up. Although possible, this is not always practical, considering the time, resources, and cost involved. In such instances, retrospective studies may be an alternative.
Long-term outcome and prognosis are particularly important in orthopaedics because implants such as hip and knee prostheses now last many years26,27. Such data naturally arise from observational studies rather than randomization. For instance, Berry et al. examined the twenty-five-year survivorship of Charnley total hip arthroplasties in 1689 patients at one institution26. They looked at time to revision or removal, and found that age, sex, and underlying diagnosis all affected the likelihood of long-term survivorship of both the acetabular and femoral components. Likewise, Callaghan et al. have been following patients with Charnley hip prostheses for more than thirty years, publishing updates every five years27. Surely, their conclusions should not be discounted merely because the study is nonrandomized and unblinded.
Other Problems with Randomization
The main strength of randomization is to protect against bias and confounding. It is for this reason that the U.S. Food and Drug Administration requires documentation of the effectiveness of a drug on the basis of the results of randomized controlled trials. But as Atkins28 pointed out, knowledge of an intervention may not be sufficient to decide whether a physician should apply the intervention to an individual patient. He stated that such evidence only answers the first of three questions posed by Cochrane: Can it work? Will it work? Is it worth it?29 The properties of a carefully designed randomized controlled trial, with selection criteria and execution under ideal conditions in high-quality sites with careful monitoring, all designed to reduce variation and eliminate confounding, may produce a sample that is so different from the pool of patients typically seen in practice as to mitigate the external validity of the conclusions. Thus, the very objectiveness of a randomized controlled trial may undermine its applicability to individual situations that require attention to the patients' beliefs and wishes, or the clinicians' attitudes30.
Another barrier to randomization results from the randomization itself. Consider a randomized controlled trial in which a patient who enters the trial is randomized to a particular surgical procedure, but the patient's surgeon prefers another procedure and is not entirely comfortable with the one that has been assigned. A valid comparison would also require randomization of the surgeon, but in practice this is not possible because surgeons would have to agree to perform whatever procedure they were randomized to. A surgeon, who may have taken care of a patient for many years, may believe that the treatment assigned will do more harm and may therefore refuse to perform it. As a result, such a patient, although a potential subject, would be excluded from the study. This would introduce a bias, the direction of which is open to speculation. Likewise, a patient may not be compliant. For example, if a researcher were interested in comparing intensive postoperative rehabilitation with moderate postoperative rehabilitation, a patient randomized to the intensive arm would need to be sufficiently motivated. It is apparent, then, that the perceived effectiveness of an intervention depends on the active participation and consent of patients and surgeons alike, and this may be a practical barrier.
A key difference between nonsurgical and surgical studies with respect to randomized controlled trials involves blinding. When examining the therapeutic value of a drug, both physician and patient can be blinded to knowledge of placebo or active treatment. This is rather difficult to achieve in surgery, because the patient needs to give informed consent for a particular type of procedure, meaning that he or she would know what the surgeon is doing, and obviously the surgeon will know what procedure he or she is carrying out. However, it is possible to blind outcome assessors subsequent to surgery, if someone other than the operating surgeon is the assessor.