Every research question has an underlying true effect size. Research efforts should aim at obtaining results as close as possible to that truth. As it is not possible to evaluate treatments in the whole target population, we use a sample to derive inferences. Sample requirements are representativeness (i.e., characteristics similar to the target population) to allow generalization and a size large enough to reliably detect or exclude the effect of interest. To conduct a trial that cannot answer the question that it poses is ethically not justifiable and is economically inefficient. Therefore, sample size estimation is critical for any proposed investigation.
The goal of this paper is to discuss the principles of sample size as relevant to clinicians. Consider a hypothetical trial that randomized 500 patients to a new surgical technique (Treatment A) and 500 patients to the standard technique (Treatment B) and assume that, in truth, Treatment A results in a relative risk reduction of 20% as compared with that resulting from Treatment B. At two years of follow-up, sixty patients (12%) in Group A and seventy-five patients (15%) in Group B required reoperation, which results in a p value of 0.19 (Table II). Although a 20% relative risk reduction was observed with Treatment A, it will not be recommended because the null hypothesis of no difference cannot be rejected based on this result. If the investigators had included 2000 patients per group, the same observed 20% relative risk reduction would have resulted in a p value of 0.006 and the new surgical intervention would have been established as favorable (Table II).
To understand the sample-size estimation process, it is necessary to consider the following basic concepts:
1. Control Event Rate. As a rule, the higher the control event rate, the smaller the sample size required. A precise estimate of control event rate through a detailed background analysis is crucial because an inaccurate assumption will result in a suboptimally powered trial if the observed control event rate is lower.
2. Effect Size or Risk Reduction. Sample-size requirement and effect size have an inverse relation, i.e., the larger the assumed effect size, the smaller the number of patients required. In chronic conditions with multiple pathways, it is unrealistic to expect that any new treatment will have a large effect, as such plausible relative effect size estimates will range between 10% and 30% and the assumption of larger effects should be avoided5,10. Even if a new treatment promises a moderate effect, it may still be clinically important if the disease is frequent; for example, if a disease that is associated with a 10% mortality rate affects 100,000 patients, even if the new treatment achieves “only” a 20% relative risk reduction (i.e., a moderate effect), 2000 lives will be saved.
3. Alpha, Beta, and Power. In the example, the new treatment showed a clinically important risk reduction (a 3% absolute risk reduction or a 20% relative risk reduction). However, in the hypothetical trial, in which there were 1000 patients (500 in each group), the estimated p value was not significant (p = 0.19). How should this result be interpreted? Study results may differ from the underlying truth due to (1) systematic error (bias) associated with methodological flaws and (2) chance (random error). The concepts of Type-I and II error (Table III) presented in the following subsections refer to random error only.
3A. Type-I or Alpha Error. If the underlying truth is that Treatment A is not better than Treatment B, then the lack of a significant difference is “real” and the study reached the correct conclusion.
Discrepancy between truth and study results may occur if Treatment A has no benefit in comparison with Treatment B but the study indicates that Treatment A is better. In this situation, the study results represent a “false-positive,” called Type-I or alpha error. By convention, we accept a probability of Type-I error of 5% or less (p values of 0.05). This means that investigators will accept a 5% chance of false-positive results, i.e., claiming that Treatment A differs from Treatment B when in reality there is no such difference. Sometimes, expensive or complex interventions or therapies with potential harm prompt the investigator to choose more stringent p values as low as 0.01 or 1%, i.e., to accept lower probabilities of false-positive results.
3B. Type-II or Beta Error. The other possible discrepancy between truth and study results arises if Treatment A is better than Treatment B but the study does not show it (i.e., the study results represent a “false-negative”). This situation is called a Type-II error or beta error and it is typically associated with small samples. This was the case in our example: the 1000-patient study results were false-negative, i.e., the observed p value of 0.19 indicated that Treatment A did not differ from Treatment B beyond chance, when in reality Treatment A was beneficial.
The probability of not incurring a Type-II error (in other words, the probability of detecting a difference when a real difference exists) is called power (1 – beta) and strongly depends on the sample size. For sample-size estimation, researchers usually set power at 80% to 95%—that is, they accept a probability of “false-negative” results of 5% to 20%. This choice will depend on the resources required for the implementation of the trial (e.g., a long follow-up, a large sample size, or high cost).
The conventional thresholds for Type-I and II differ because Type-I error (claiming a nonexistent treatment effect) is considered more severe than Type-II error (not detecting an existing treatment-effect); so most researchers accept a 5% probability of false-positive conclusions and a 5% to 20% probability of false-negative conclusions. The large number of trials taking place globally and these liberal error rates should encourage researchers to design their trials with use of more stringent thresholds (e.g., a Type-I error of 1% and 90% power).
4. Sample-Size Requirements. To convey a sense of the sample-size requirements when addressing patient-important outcomes, we will calculate the sample size needed to demonstrate a significant effect in the example (15% control event rate, 20% relative risk reduction through Treatment A) at an alpha of 5%. If we target 80% power, the required sample size will be 4072 patients. If we set the power at 90%, the sample-size requirement will increase to 5450 patients11.
It is important to understand that sample-size requirements vary according to the type of primary outcome selected. In our example, we considered a binary outcome (i.e., the need for reoperation). Patient-important outcomes are typically binary (e.g., death versus survival, infection versus no infection, or returning to work versus not returning to work). However, other types of outcomes may also be of interest, such as quality of life, surrogate outcomes (i.e., bone density or range of motion), or length of hospital stay. These kinds of continuous data result in a smaller estimated sample size in the presence of the same effect size. Sample-size estimation may be challenging for clinical researchers, and a statistician should always be involved early during the trial design.
To minimize the required sample size, the described properties of sample-size estimation based on Type-I and II error may tempt clinical researchers to (1) select surrogate end points rather than patient-important end points, (2) select composite outcomes, and (3) overestimate the control event rate and effect size.
First, surrogate outcomes can be misleading because a treatment may affect the surrogate end point but not result in any improvement in patient-important outcomes, as was the case with bone density and osteoporotic fractures in prior studies12. Patient-important outcomes are underreported in orthopaedic trials, according to a systematic review that included 171 RCTs of patients with fractures13. The type of outcome described involved physiological evidence (i.e., range of motion and strength) in 56% of trials and radiographic evidence (i.e., fracture union and implant placement) in 82% of trials. Patient-important outcomes tended to be assessed less frequently, including quality of life in 9% of the studies, return to work in 19% of the studies, need for additional procedure in 35% of the studies, and mortality in 27% of the studies13.
Second, the selection of a composite outcome to increase the control event rate and thus reduce the sample-size requirement can also raise methodological concerns. To be valid, a composite outcome has to fulfill some conditions (components of similar importance to patients, similar event rates, and similar relative risk reduction based on a common biological rationale for the treatment effect)3.
Third, if researchers use an implausibly large effect size (i.e., a relative risk reduction of >30%)5, they may incur a Type-II error because they will underestimate sample-size requirement.
Finally, the sole reliance on sample-size calculation based on Type-I and Type-II errors may be misleading because the randomization of a limited number of patients does not guarantee prognostic balance5 and because a large number of events is required for stable estimates of the effect5,14. In the presence of a limited number of events, treatment effects may undergo large random fluctuations, thus generating false results. To illustrate the number of events that is required to obtain reliable results, we propose viewing small randomized trials as “large randomized trials stopped early” for benefit. In a systematic review in which the results of RCTs stopped early for benefit were compared with the results of nontruncated RCTs, study truncation after fewer than 200 events resulted in a severe overestimation of the treatment effect when compared with completed studies. When the RCTs were stopped after more than 500 events, the overestimation was much less pronounced14. To avoid the play of chance and to obtain stable results in trials that were designed to detect moderate effect sizes, two pivotal articles5,15 were even more stringent and proposed that RCTs should aim at >650 events.
Examples in which the results of large trials contradicted the results of small trials include the Corticosteroid Randomisation After Significant Head Injury (CRASH) trial16,17. The pooled evidence from small studies suggested that there was a risk reduction for mortality in patients with head injury who were treated with corticosteroids16. The CRASH trial, with a planned sample size of 20,000 patients, was stopped for harm after the enrollment of 10,000 patients because it was found that corticosteroid administration resulted in excess mortality17.
As a guide, it can be assumed that the recruitment of >1000 subjects per treatment group is a reasonable threshold to improve precision in results18. This estimate, however, is not intended to dismiss the use of more limited sample sizes in feasibility or explanatory RCTs that have been designed to understand a specific mechanism and that do not address clinical effectiveness.
If the implementation of large pharmacological trials is demanding, this is true to a greater extent for surgical trials that are challenged by difficulties in masking, performing placebo-surgery, and, by differential expertise (i.e., the fact that surgeons tend to have higher proficiency in performing only one of the procedures under investigation)19. However, large RCTs in orthopaedic surgery are feasible: the Study to Prospectively Evaluate Reamed Intramedullary Nails in Patients with Tibial Fractures (SPRINT) investigators20 successfully randomized 1319 patients with tibial shaft fracture to reamed versus unreamed intramedullary nailing in twenty-nine centers. The RCT “Sutures Versus Staples for Wound Closure in Orthopaedic Surgery: A Randomized Controlled Trial” (NCT01146236), with a planned sample size of 2560 patients, is currently recruiting; and the RCT “Prospective, Comparative, Randomized Study of GVF Versus Cross-Linked Polyethylene in Total Knee Arthroplasty” (NCT00289133), with a planned sample size of 900 patients, is expected to complete follow-up by June 2012.
Even for the study of highly prevalent diseases like knee and hip osteoarthrosis, the enrollment of thousands of patients is beyond any single institution and some countries; as such, the implementation of large RCTs requires the development of international research networks.
In conclusion, orthopaedic RCTs are frequently affected by small sample size. Consequently, imprecision makes existing evidence unreliable. Inadequately powered trials are misleading and represent a waste of resources. The burden of orthopaedic diseases justifies the challenge to perform large RCTs intended to obtain high-quality evidence.