The equivalence trial is different from the difference trial. First, the goal of the equivalence trial is often to establish that one treatment is clinically equivalent to another treatment in terms of a particular outcome (e.g., complications, postoperative patient function, or mortality). Evidence that the interventions are equivalent might be meaningful because the new intervention might possess additional benefits such as lower cost, improved safety, or greater ease of use. Often, the one-sided version of this design (the noninferiority trial) is used to assess whether the new intervention is “at least as good as” the old intervention.
The distinctions between the difference trial and the equivalence trial are summarized in Table I. For the equivalence trial, the null hypothesis is that |Mean1 – Mean2| ≥ d, where d is a prespecified threshold equal to the largest difference that is still considered to be “clinically meaningless.” Deciding on this threshold (d) is a difficult and critically important aspect of the equivalence design. It serves as the operational definition of “the same as.” The alternative hypothesis is that |Mean1 – Mean2| < d, or that the confidence interval for the difference between the means falls within the prespecified threshold.
The power analysis for the equivalence trial is based on d, the prespecified threshold for what is considered to be “clinically meaningless.” The smaller the value of d, the larger the sample size will need to be. In the simplest case, patients are randomized to the two interventions and the outcomes can be compared with use of a confidence interval method or an equivalence test with a prespecified value for α (e.g., 0.05).
There are two basic methods for evaluating the equivalence hypothesis. The first and more complicated method is a “two one-sided tests” (TOST) procedure in which the two null hypotheses are H01: Mean1 – Mean2 ≤ –d and H02: Mean1 – Mean2 ≥ d, and the alternative hypotheses are HA1: Mean1 – Mean2 > –d and HA2: Mean1 – Mean2 < d. If each of the one-sided null hypotheses is rejected with use of a one-sided t test at the prespecified α, then we have support for the equivalence hypothesis that –d < Mean1 – Mean2 < d. Procedures for testing equivalence with use of a TOST procedure have been implemented in some statistical software packages such as SAS (SAS Institute, Cary, North Carolina) and R (R Foundation, Vienna, Austria).
A substantially simpler, and therefore more commonly used, method8 involves constructing the confidence interval for the difference between the means and checking whether it falls completely within the interval from –d to d. A properly constructed confidence interval for the parameter of interest allows the researcher to perform the hypothesis testing merely by inspection and gives much richer information regarding the range of plausible values for the difference. However, one must understand how to choose the level for the confidence interval, and misinformation abounds. Here is the simple rule, which is outlined in more detail elsewhere9: When the underlying hypotheses involve one-sided tests, as is true for equivalence and noninferiority designs, then the corresponding confidence intervals should be at the (1 – 2α) level. Therefore, if you want to be 95% sure not to commit a type-I error (i.e., α = 0.05), then a 90% confidence interval should be used in the analyses. If you want to be 97.5% sure not to commit a type-I error (i.e., α = 0.025), then a 95% confidence interval should be used. When the underlying hypotheses involve two-sided tests, as is true for most difference designs (e.g., H0: Mean1 – Mean2 = 0) but not for equivalence designs, then the corresponding confidence intervals should be at the (1 – α) level.
Regardless of the α level that is chosen, if the relevant confidence interval for the difference between the means falls entirely within the zone of equivalence (–d to d), the null hypothesis is rejected and the researcher claims this as evidence in support of the alternative hypothesis, which states that the means are equivalent within the threshold d. If the null hypothesis is not rejected, however, the researcher has failed to find evidence of equivalence. This is different from finding that the interventions are significantly different.
Figure 1 helps to illustrate these distinctions. It shows four 95% confidence intervals for the difference between the means of two interventions.
The top 95% confidence interval represents a comparison in which a researcher running a difference trial (with α = 0.05) and analysis would conclude that there is no evidence of a difference between the means because the interval includes zero. However, a researcher running an equivalence trial (with α = 0.025) and analysis with an equivalence threshold of d = 10 would conclude that there is no evidence of equivalence because the interval extends beyond –10 and 10. This scenario exemplifies the problem with failing to find a statistically significant difference with use of a t test and claiming equivalence; an interval that contains both zero (i.e., no difference) and values that represent a clinically meaningful difference (i.e., no equivalence) provides evidence for neither equivalence nor difference. Many small-sample studies in the orthopaedics literature produce similar results.
The second 95% confidence interval represents a comparison in which a researcher running a difference trial and analysis would reject the null hypothesis and claim evidence of a difference between the means because the interval does not include zero. However, a researcher running an equivalence trial and analysis with an equivalence threshold of d = 10 would conclude that there is no evidence of equivalence because the interval extends beyond 10.
For the third 95% confidence interval, a researcher running a difference trial and analysis would fail to reject the null hypothesis because the interval includes zero. A researcher running an equivalence trial and analysis with an equivalence threshold of d = 10 would reject the null hypothesis and claim evidence of equivalence because the interval lies entirely between –10 and 10. Under this scenario, researchers who have used a difference trial analysis, found no evidence of a difference, and erroneously claimed equivalence have just happened to reach the correct conclusions by using the wrong analysis, including the unintentional application of a more stringent α of 0.025 rather than 0.05. If their confidence interval had a larger range that extended above 10 (e.g., the first interval in the figure), their conclusion would have been wrong.
The bottom 95% confidence interval does not contain zero and lies between –10 and 10. The null hypotheses for both the difference and equivalence trials would therefore be rejected. This scenario shows that an effect can be statistically different from zero even though the interventions are clinically equivalent.
The goal of a noninferiority study is to evaluate whether the result of a new intervention (Mean1) is at least as good as the result of another intervention (Mean2); assuming that larger values are clinically better (e.g., survival, quality of life), the null hypothesis is H0: Mean1 – Mean2 ≤ –d, and the alternative hypothesis is HA: Mean1 – Mean2 > –d. If α is set to 0.05, this analysis can be done by constructing a 90% confidence interval and checking that its lower limit is greater than –d.
In the following example of an equivalence trial, researchers compare a novel approach for treating a medical condition with a traditional approach. Since the novel approach has advantages over the traditional approach in terms of cost savings, the goal of the investigators is to determine whether the novel approach is equivalent to the traditional approach in terms of clinical efficacy. The outcome in this example is the patient score on a self-reported health-related quality of life measure with a scale of 0 to 100. Patients with the particular condition being studied are randomized to receive either (1) the novel treatment or (2) the traditional treatment. Clinical outcomes are assessed at a prescribed time point. The researchers decide that the largest difference that would still be clinically meaningless is 10 points on the 100-point quality of life measure (i.e., d = 10). The researchers also use published estimates of population-level scores on the outcome measure to estimate that the standard deviation of scores on the quality of life measure for this patient group should be 10.
The sample size for each group is determined so that the power, or 1 – β, of the study is 0.90; i.e., the probability that the two treatments will be deemed equivalent if they are, in fact, equivalent is 0.90. The sample size calculations can be made with use of a Z-table and a calculator, as shown in Figure 2, or with use of statistical software, such as PASS 2008 (NCSS, Kaysville, Utah) or SAS, that contains routines for power analysis of equivalence tests. The required sample size is determined to be twenty-one per group or forty-two total.
The null hypothesis for the trial is H0: |Meannovel – Meantraditional| ≥ 10. The alternative hypothesis is HA: |Meannovel – Meantraditional| < 10. The trial is conducted and yields findings of Meannovel = 53.8 (standard error = 2.5) and Meantraditional = 56.3 (standard error = 2.5). The difference between the means is therefore –2.5 (i.e., 53.8 – 56.3), and the standard error of the difference is 3.54 (i.e., the square root of [2.52 + 2.52]). The 90% confidence interval of the difference between the means equals the difference plus and minus 1.65 times the standard error of the difference. In this example, the confidence interval is 2.5 ± 5.83, or –8.33 to 3.33. The null hypothesis is rejected and the alternative equivalence hypothesis is supported because the interval is contained within –10 to 10 and therefore satisfies the prespecified definition of “same.”
Some statistical programs have built-in TOST procedures that test the equivalence of means and calculate a p value. In our example, the associated p value is 0.02. (Data for this example and syntax for the TOST procedure in the R software package are available from the authors.)
Disclosure: None of the authors received payments or services, either directly or indirectly (i.e., via his or her institution), from a third party in support of any aspect of this work. One or more of the authors, or his or her institution, has had a financial relationship, in the thirty-six months prior to submission of this work, with an entity in the biomedical arena that could be perceived to influence or have the potential to influence what is written in this work. No author has had any other relationships, or has engaged in any other activities, that could be perceived to influence or have the potential to influence what is written in this work. The complete Disclosures of Potential Conflicts of Interest submitted by authors are always provided with the online version of the article.