Determining whether one thing causes another is a challenging task. The randomized controlled trial is widely accepted as the most definitive research design for examining the effectiveness of treatments1. By controlling the timing or the amount of the intervention or which subjects receive the intervention and which do not, the chances are minimized that factors outside of the control of the researcher could have affected the results2,3. An advantage of using a randomized controlled trial rather than other study designs is that a randomized controlled trial can definitively establish a cause and effect relationship and limit sources of systematic error2.
In many situations, use of a randomized controlled trial is not feasible. The next most powerful method to introduce evidence comes from cohort studies1,2. In a cohort study, the researcher does not control the intervention but rather observes the effects of different interventions on outcomes as they occur naturally following the intervention4. The minimum design feature of the cohort study is the existence of two treatment groups: an active group, and a comparison, or control, group. In a cohort study concerned with the effect of a surgical intervention, the investigator starts with a group of individuals who are apparently free of the outcome of interest. This group, or cohort, consists of individuals who have been given a particular diagnosis and who receive either the experimental intervention, an alternative active intervention (for example, usual care), or no treatment at all. Participants are then followed over the course of time to determine the prevalence of the outcome in each group.
In a prospective cohort study, the investigator collects information about the patient and the intervention from the time that the study begins and then identifies new occurrences of the outcome from that time forward. In contrast, in a retrospective cohort study, the investigator collects patient-related and intervention-related information that has been recorded at some time in the past and then determines outcome on the basis of events that have occurred between that time and the present.
Compared with prospective cohort studies, retrospective cohort studies offer the advantage of a relatively short time frame to completion, thus resulting in considerably less cost. The advantage of prospective data collection lies in the nature of the collected data: checklists to establish a diagnosis are operationalized, careful documentation of the intervention and any co-interventions are standardized, and information on potential confounding factors is specified before the commencement of data collection and these factors remain constant throughout the course of the study. In retrospective studies, the definitions of symptoms of disease may have been modified over time, units of measurement may have changed, old methods for diagnosis may have been replaced, and follow-up times may be inconsistent between patients, thereby resulting in greater variability in the data. Perhaps the most important advantage associated with prospective studies is that they allow a determination of a temporal sequence of events (i.e., a determination of whether the intervention came before the outcome)2,4.
Other advantages of cohort studies are related to their increased feasibility as compared with the feasibility of randomized controlled trials. For example, cohort studies do not usually interfere with the regular clinical diagnostic and treatment decision-making process. Clinicians are not asked to compromise their clinical judgment and can treat patients as they would usually treat them. Further, clinicians are not asked to provide treatments that, in their opinion, they have not mastered, thus reducing performance bias. In addition, some patients do not feel comfortable consenting to participate in studies that leave the choice of the treatment to random chance. These patients are perhaps more likely to consent to participate in a nonrandomized study. Thus, from the standpoint of clinician buy-in as well as patient buy-in, a cohort study may be a more feasible design.
Cohort studies are usually substantially less demanding than randomized controlled trials in terms of the logistics of coordinating them. For example, in randomized studies, treatment allocation often does not occur until final eligibility can be determined when the patient is in the operating theater, which necessitates that both interventions (experimental and control) are available and ready for use. In a cohort study, however, treatment assignment is determined in advance of the procedure.
Disadvantages of the cohort design include threats to internal validity. The following sections discuss design features that can help to reduce the potential influence of these threats.
Select an Appropriate Comparison Group
One of the most difficult aspects of properly designing a cohort study is selecting an appropriate control group4. It is of utmost importance to avoid creating a selection bias. Selection bias results in an erroneous conclusion due to systematic differences in characteristics between the patients who receive the experimental intervention and those who do not2,4. To achieve the most accurate estimation about the association between outcomes following two competing interventions, data from all patients undergoing the procedures of interest should be included. Practical considerations dictate that investigators will be able to achieve follow-up of only a small proportion or sample of these patients; however, if the sample was selected appropriately, the estimates of the effect of treatment should closely represent the true difference in effect.
There are two common types of comparison groups: historical and concurrent. In a cohort study in which a historical comparison group is used, the investigators acquire data from a current group of patients and compare them with pre-existing data from a past group of patients. These data may come from a previous study, a registry, or published studies in the literature. Unfortunately, most historical control groups are compromised, usually because of changes in environmental factors over time, including clinical factors such as changes in nursing practices, rehabilitation protocols, or adjunct medications that may influence the outcome of that group.
If a population is sampled and participants are classified according to the intervention they received, then the natural comparison group is a group of patients from the same population who did not receive that intervention. In our example above (anterior cruciate ligament reconstruction compared with nonoperative treatment), a control group would consist of patients who elected to forgo anterior cruciate ligament reconstruction but who had an anterior cruciate ligament-deficient knee and similar prognostic characteristics (e.g., age, sport participation, and concomitant injury) as those who underwent the procedure.
Such a comparison group does not always exist; for instance, it may be difficult to assemble an appropriate and reasonably sized cohort of subjects who elect the nonoperative approach following anterior cruciate ligament rupture. The comparability of one intervention group with the other can only be made after careful evaluation of characteristics that might affect the likelihood of the outcome. The possibility of self-selection or physician-selection into the intervention group always needs to be considered. Comparing the success rates of improperly selected groups may lead to erroneous conclusions (selection bias), because investigators will conclude that the difference between the groups is attributed to the intervention rather than to the factors that originally placed the subject in one group and not the other (see the topic of confounding, below). It is possible that the reason for selecting a patient for an intervention is exactly why that intervention has worked for them and that using the intervention more broadly, which may occur in a randomized controlled trial when patients are randomized to the intervention group, may result in data that reflect a reduction in the effectiveness of that intervention.
Confirm Outcome-Free Status
At the beginning of the study, the investigators must determine (as best they can) that participants in both groups are free of the outcome of interest4. The most rigorous method of making this determination is to provide objective criteria that define the presence or absence of the outcome and then to have a skilled evaluator apply the criteria to each potential participant. These criteria must be applied systematically to each participant. For example, one might want to compare two interventions for prevention of anterior cruciate ligament rupture in an athletic population. A necessary step in designing this study would be to first confirm that each participant has an intact anterior cruciate ligament before he or she receives any intervention. To do this, one would need to define criteria for determining the integrity of the ligament. The most objective method of making this determination would be to ask each participant to undergo a diagnostic arthroscopy or magnetic resonance imaging study. Because arthroscopy is likely to be the most impractical method in terms of gaining ethics approval or consent from participants, and because magnetic resonance imaging requires a substantial budget, a more reasonable (albeit more subjective) set of criteria based on physical examination tests such as the Lachman or pivot-shift test might be used to assess the integrity of the anterior cruciate ligament.
Determine the Method of Measuring Outcome
Methods for determining which patients have the outcome of interest can vary substantially, depending on the disease being studied and the resources available to measure outcome. Accordingly, diagnostic criteria should be established before the study begins. No matter what method is selected, procedures for determining outcomes must be comparable between the groups4.
Bias is the systematic tendency to produce an outcome that differs from the underlying truth6. Human behavior is influenced by what we know or believe. An objective outcome is one that is independent of opinion or bias7. A subjective outcome is everything else. In research there is a particular risk that expectation will influence findings, most obviously when there is subjectivity in the assessment. In terms of classifying outcome measures as subjective or objective, it can be quite challenging to identify when the interpretation of an outcome could be biased and to what extent it is vulnerable to influence.
Considering that almost all determinations regarding outcome are associated with some degree of subjectivity, outcome assessment is least likely to be biased when participants (that is, patients, clinicians, data collectors, and biostatisticians) in the study are unaware of or blinded to which intervention the patient received4,8. In a cohort study, it is often feasible and of great methodological importance to blind the data collectors so that any conscious or unconscious opinion held by the evaluator about the association between the outcome and intervention does not influence the assessment of outcomes. If data collectors are aware of group membership, they could be more alert or attentive to signs of improvement or deterioration, perhaps being more diligent when looking for the outcome in patients who received the intervention or vice versa.
In addition, no matter what the study design is, when researchers are selecting instruments with which to measure outcomes, they should look for existing evidence that the instruments are associated with good measurement properties (i.e., the properties of validity and reliability)9-11. If the objective of the research is to track change over time, the instrument selected should also have shown evidence for its ability to detect changes that are important to patients (i.e., it should possess sensitivity to change and responsiveness)9,11. The assessment of the measurement properties of an instrument should not be restricted to patient-reported outcomes, but assessment of measurement properties is necessary when selecting any outcome (for example, range of motion or imaging)10.
In studies that make use of a historical control group, researchers must depend on existing records to ascertain outcomes and confounders. A considerable disadvantage of retrospective cohort studies is that the information collected by different clinicians is unlikely to be standardized across institutions or even across practices. Thus, diagnostic criteria may vary from one clinician to another, and some records will be more complete than others. All of these issues will contribute to the variability or so-called noise in the dataset, making it more difficult to detect possible differences between groups. Further, if data regarding important confounders are missing, statistical adjustment is not possible. In addition, if a high proportion of patients are excluded because their files are incomplete, the external validity or applicability of the results may be threatened and, if the completeness of files is related to the outcome, a biased estimate of treatment effect may result.
Reduce the Effect of Confounders
A confounder is an intervention-associated variable that is a risk factor for the outcome of interest; such a variable distorts the effect of the intervention on the outcome6. The result of confounding is an apparent association between the intervention and the observed outcome when in fact no such association exists4,6,12. In a randomized controlled trial, researchers hope that randomization will balance the groups in terms of the confounder. In a cohort study, where randomization is not part of the design, researchers have several options. The most common design approach to reduce the effect of confounding is exclusion12. By restricting the eligibility criteria, researchers may achieve a more homogeneous sample. For example, if only subjects with a certain type of fracture are enrolled, the confounding effect of fracture severity may be avoided. The more restrictive the eligibility criteria, however, the more difficult is study recruitment and the greater the reduction in external validity or applicability of the results.
Another option is for researchers to match individual patients according to the confounding variable. Matching is defined as the process of making two groups comparable with respect to extraneous factors. This is relevant for retrospective designs. Several kinds of matching have been described12. Caliper matching refers to matching subjects within a specified interval for a continuous variable (for example, body mass index within two points). At the level of the group, matching refers to selecting a control group that has certain characteristics that, when aggregated, are similar to the aggregated characteristics of the experimental group. For example, the control group may consist of patients who are drawn from the same hospital but who have a different diagnosis, patients who are drawn from the same neighborhood, or patients who work at similar jobs.
Matching at the individual level refers to creating pairs of experimental and control subjects who are as similar as possible in terms of certain key variables, such as age, sex, or diagnosis. For example, if a researcher were to perform a study to compare the effectiveness of surgical treatment of acute Achilles tendon ruptures as compared with nonsurgical treatment of the same, he or she might elect to exclude diabetic patients (because they are believed to be a unique subgroup within the population) and match eligible participants by age (within five years) and activity level. The researcher could then further adjust for age at the time of analysis. Taking these steps reduces the chance that the observed differences in outcome between groups are the result of between-group differences in age and activity level.
The purpose of matching is to eliminate the effect of the extraneous variables on the estimate of the differences between groups for the outcome of interest. If the two groups are matched by age, for example, then any difference in outcome between the groups in that study cannot be attributed to a difference in age.
Although the theory behind matching makes sense, it can be challenging from a logistical standpoint, especially when trying to match on more than two characteristics. Perhaps the most common option for dealing with variables that may be confounding is to try to account for the differences between groups statistically, usually with use of Mantel-Haenszel and/or regression methods.
First and foremost, however, it is important to understand that if confounding variables are not accurately measured, they cannot be adequately taken into account in the analysis; hence, all criteria for establishing the properties of instruments to measure outcome also apply to the measurement of confounding variables (i.e., the methods must possess the properties of reliability and validity). Unfortunately, this is not always straightforward. Consider smoking, for example, which is thought to be a confounding variable for the outcome of bone-healing. How do we measure smoking? When is it important to measure smoking? How often should we measure smoking status, or is it important if smoking status changes during the course of study participation?
Guard Against Co-Intervention
Co-intervention is another potential form of confounding. A co-intervention is an intervention that is not given as part of the study but that affects the outcome of interest6. For example, in a study that compares the effectiveness of two nonsteroidal anti-inflammatory drugs for the treatment of arthritis, if some patients also take over-the-counter aspirin, the investigator can no longer be certain which intervention is contributing to the outcome. If the study is prospective, investigators can try to control or specify co-interventions that are permitted or, if the study is prospective or retrospective, investigators can record the use of co-interventions and adjust for their effects in the analysis at the end of the study.
Limit the Proportion of Nonparticipants and Patients Lost to Follow-up
Nonparticipation can create another form of selection bias whereby erroneous conclusions result from systematic differences in characteristics between those who are selected for study and those who are not4. The effect of nonparticipation on measures of association depends on both the size of the group omitted from the study and the specific characteristics of that group. Bias is likely to be greatest when the proportion of nonparticipants is high and when the participants differ greatly from nonparticipants with regard to the likelihood of development of the outcome4.
Losses to follow-up tend to produce the same sorts of biases as nonparticipation. However, because loss to follow-up occurs after entrance into the study, it has a greater potential to be related to outcome than does nonparticipation, which by definition occurs early in the study. One way in which the investigator may be alerted to the possibility of bias resulting from loss to follow-up is through the discovery that the groups differ in terms of the proportion of patients lost to follow-up.