Abstract
This article focuses on items to consider when selecting outcome measures for a clinical study. The choice of outcome measures depends largely on the research question and the study design. Sample-size requirements can vary greatly, depending on the type and the number of outcome measures selected. In this paper, we review the differences between categorical and continuous outcomes as well as the differences between primary and secondary outcomes and we discuss the concept of minimally important differences and the problems associated with composite outcomes. We also provide instruction on how to conduct and present a sample-size calculation.
Different types of outcome measures can be used in orthopaedic studies. These measures include radiographic parameters (nonunion and malunion), findings on clinical examination (strength and range of motion), laboratory values (infection), clinical scores, and patient satisfaction. The choice of outcome measure in a study depends largely on the research question and the study design. Sample-size requirements can vary greatly, depending on the type and the number of outcome measures. The purpose of this article is to provide education about outcome parameters and sample-size calculations to readers who are planning to design studies.
Until the 1990s, outcome measures in orthopaedic studies included mainly clinical examination and radiographic findings, such as fracture union, range of motion, strength, and nonvalidated scores. These measures do not necessarily express how the patient perceives the burden of an injury or the results of treatment of that injury, and often they do not provide definitive answers about the effectiveness of a certain treatment. In recent years, these measures have increasingly been complemented by the use of health-related quality-of-life measures. The inclusion of health-related quality-of-life instruments in clinical studies aids in the understanding of patient perspective with regard to treatment results. Health-related quality-of-life instruments typically include multiple domains that are related to the patient's physical, mental, and social well-being. These instruments are questionnaires that include questions regarding the patient's activities of daily living, social interactions, and the patient's satisfaction derived from those activities. Health-related quality-of-life instruments are especially important to include in a study when the goal of the treatment is to improve the patient's ability to function, which is one of the goals of most orthopaedic interventions.
Two main types of health-related quality-of-life instruments exist: (1) generic instruments, and (2) disease-specific instruments. Generic health-related quality-of-life instruments measure the patient's general health status, including physical, mental, and social dimensions. These broad-scope generic questionnaires, such as the Short Form-36 (SF-36)1 and the EuroQol-5 Dimensions2, are particularly useful in comparing health status between patients with different diseases and/or interventions. However, generic instruments are often not sensitive enough to detect small and potentially clinically important differences. On the other hand, disease-specific instruments, such as the Disabilities of the Arm, Shoulder and Hand instrument3, focus on the effect of a disease on a specific aspect of physical, mental, or social well-being and are more likely to detect small differences. Disease-specific instruments are often too focused to allow for the comparison of results between studies of different diseases or often even between different populations with the same disease. Therefore it is advised, and often required by granting agencies, to include both a generic and a disease-specific outcome instrument in the design of a clinical study.
Health-related quality-of-life instruments have to be valid, reliable, and responsive in order to be useful for clinical research. A valid instrument measures what it is supposed to measure (for example, the patient's health status). Validity assessment is a complex issue and requires evidence that shows the extent to which the instrument measures what it was intended to measure in specific patient populations. The types of validity that should be assessed include face, content, and construct validity. A health-related quality-of-life instrument has face validity simply if it appears to measure what it is intended to measure. Investigators should describe the items of an instrument so that readers of an article can get an idea of its face validity. Content validity systematically compares a new instrument to existing established definitions, opinions, and other instruments. Construct validity measures health (i.e., a construct) by measuring aspects that are associated with good health, for example, the patient's ability to work or to interact with family and friends, or the absence of pain.
A reliable instrument provides a consistent measure when it is applied repeatedly, provided the health of the population is stable (test-retest reliability). It is important to distinguish between reliability and validity because an instrument can be reliable but not valid, or it can be valid but not reliable. Lastly, responsiveness refers to the ability of an instrument to detect health-status changes that are important to patients and to reflect those changes in score differences.
A combination of multiple outcome measures is likely to be more informative and able to serve as a basis for future decision-making than a single outcome measure is. However, there is a substantial disadvantage associated with measuring multiple outcome parameters. Comparing multiple outcomes between treatment cohorts results in multiple statistical comparisons. The likelihood of a positive finding with a p value of <0.05 increases with the number of statistical tests performed (i.e., one of ten statistical comparisons is more likely than one of two comparisons to result in a p value of <0.05 just by chance alone). In statistics, the likelihood of a false-positive result that we are willing to accept is expressed as a or the p value. By definition, the p value is arbitrarily set to 0.05 (5%) if only one outcome parameter is being measured. However, when multiple outcome measures are being assessed, the p value that indicates the cutoff of what we consider to be a significant difference should be decreased in order to offset the likelihood that a difference in one outcome measure results in a significant difference by chance alone when multiple outcome parameters are measured. Typically, a Bonferroni correction4 is applied in order to adjust for multiple testing. The cut-off standard p value of 0.05 is divided by the number of outcome parameters analyzed; for example, when five outcome parameters are being assessed, the cutoff p value for what we consider to be a significant difference becomes 0.01. Unfortunately, however, if we want to detect a difference at a 0.01 significance level, we need a substantially higher sample size compared with that needed for the assessment of an outcome parameter at the standard significance level of 0.05.
In order to avoid increased sample-size requirements, it is advised that investigators define one primary outcome parameter a priori (before the start of the study). For the primary outcome parameter, which is used for the sample-size calculation, a standard significance level of 0.05 is chosen; for all remaining secondary outcome parameters, the Bonferroni correction is applied. Practically speaking, in this example, this means that differences between cohorts in all secondary outcome parameters will only be considered significant if the p value is <0.0125 (when four secondary outcome parameters are investigated).
The reader of an article cannot know for sure whether a certain outcome parameter was determined to be the primary outcome parameter in a truly a priori manner. It is theoretically possible that investigators of a study evaluated multiple outcome parameters and that, after analyzing them, they claimed post hoc that whatever outcome parameter resulted in the lowest p value was the a priori defined outcome parameter. In order to assure readers that outcome parameters were defined before the start of the study (a priori), investigators can register their study protocols before the start of their study at , a publicly available trial registry which meets the criteria of the International Committee of Medical Journal Editors (ICMJE) and is sponsored by the United States National Library of Medicine.
Mathematically, most outcome parameters are measured on either a continuous scale or a categorical scale. Continuous outcome parameters are essentially numbers associated with the individual study subjects. On the other hand, as implied by the name, categorical outcomes have multiple categories, one of which is associated with the patient (e.g., excellent, good, fair, or bad outcome). A special case of a categorical outcome is a dichotomous outcome. Dichotomous outcomes have two categories (e.g., union or nonunion, infection or no infection, or revision or no revision). Dichotomous outcomes are expressed for the entire study cohort as a percentage (e.g., a 10% nonunion rate) and are therefore sometimes confused with continuous outcomes. However, each single patient is in one of two categories (e.g., union or nonunion). Sample-size calculations as well as the analysis of continuous and categorical outcomes are entirely different. As demonstrated in the following paragraphs, sample-size requirements for dichotomous outcomes are much higher than for continuous outcomes.
Journal readers and reviewers often focus on statistical significance without questioning whether those differences are clinically relevant. Ultimately, statistically, any difference can be significant if the sample size is large enough, but does that mean that the difference is clinically important? Unfortunately, the answer to this question is very subjective and can often be very arbitrary. When designing a study, after determining the primary outcome parameter that will be used to calculate the sample size, it is essential to postulate the clinically minimally important difference in the chosen outcome parameter. An example would be the question: What differences in failure rates would convince you as the surgeon to choose one treatment over the other? The answer is even more subjective when looking at a continuous outcome measure, such as clinical score (e.g., a health-related quality-of-life instrument), as it is often unknown what score differences are clinically relevant to the patient.
Jaeschke et al. defined a minimally clinically important difference as "the smallest difference in a score of a domain of interest that patients perceive to be beneficial and that would mandate, in the absence of troublesome side effects and excessive costs, a change in the patient's management."5 Minimally important differences in health-related quality-of-life scores have been estimated with use of two methods: (1) an anchor-based approach, and/or (2) a distribution-based approach6-8. When using an anchor-based approach to determine the minimally important difference of an instrument or an instrument domain, the score of interest is correlated with a measure of clinical change, the so-called anchor, in the patient population of interest. The anchor might consist of one or more questions the patient is asked, such as: Has your health become "a lot better," "a little better," "stayed the same," "a little worse," or "a lot worse?"6 Differences in the score of interest for patients that fall into each category can then be used to calculate the minimally important difference. This approach has two drawbacks, however: first, estimating the minimally important difference requires agreements about what constitutes a minimal change in the anchor6; and second, to our knowledge, ultimately established so-called "hard" numbers for the minimally important difference based on this approach currently do not exist for commonly used orthopaedic health-related quality-of-life measures such as the Short Form-36, the Short Musculoskeletal Function Assessment, or the EuroQol-5 Dimensions9-11.
In the alternative "distribution-based" approach to calculating the minimally important difference, differences in scores between treatment groups are expressed as multiples of a measure of distribution, such as a standard deviation, which is also known as the standard mean difference or effect size, as devised by Cohen12. Alternatively, effect sizes can be thought of as the percentile standing of the average experimental participant (e.g., a femoral neck fracture fixed with three screws) relative to the average control participant (e.g., a femoral neck fracture fixed with two screws). Based on a normal distribution of values, which is typically displayed as a Gaussian (bell-shaped) curve, an effect size of 0 indicates that the mean of the experimental group is at the 50th percentile of the control group. In reverse, this means that if the statistical means of both groups are equal, the effect size is 0. An effect size of 0.5 indicates that the difference between two compared groups is 0.5 standard deviations. If the standard deviations are not the same in the compared groups, a pooled standard deviation across both groups is used. Cohen defined effect sizes as small (0.2), medium (0.5), and large (0.8)12. Practically speaking, this means, as a rule of thumb, that for any given clinical score, one-half of its standard deviation represents a medium effect and 0.8 of its standard deviation represents a large effect. Critics of this approach argue that this is a purely mathematical way of looking at the importance of differences, with little clinical meaning. However, there is convincing psychological and empirical evidence to suggest that, in general, one-half of a standard deviation of a continuous outcome constitutes a clinically meaningful difference.
In a systematic review, Norman et al. identified thirty-eight studies with various types of health-related quality-of-life measures as outcome parameters13. An inclusion criterion for the studies was the use of a minimally important difference value for interpretation of the results (or clinical significance, meaningful change, relevant change, important difference, or relevant difference). Norman et al. calculated the effect sizes of thirty-eight studies and compared them with the clinically significant differences that the authors of the respective studies were using. The minimally important difference was close to 0.5 standard deviations in thirty-two of the thirty-eight studies (0.495 ± 0.155)13. In order to explain this striking finding, Norman et al. drew an interesting analogy to Miller's classic 1956 work ("The magic number seven plus or minus two"), which evaluated the ability of individuals to discriminate between multiple categories of various stimuli, such as loudness or saltiness13,14. Subjects were capable of distinguishing between different categories of a particular stimulus until the number of categories reached approximately seven (plus or minus two)14. The "minimally detectable difference" is therefore one unit in seven. So what is the difference between seven and six (or between six and five) measured in standard deviations as a unit? Interestingly, when converting "one part in seven" to standard deviation units, it turns out that it almost equals one-half of a standard deviation, as a distribution seven units wide has a standard deviation of 2.16. A difference can only be clinically important if the patient is able to "feel" the difference. Therefore, the minimally important difference has to be at least as big as the minimally detectable difference. Based on this logic, it appears plausible to use half of a standard deviation of a continuous outcome score as a minimally clinically important difference for a sample-size calculation.
In an effort to quantify sample sizes and magnitude of treatment effects, Sung et al. recently completed a review of orthopaedic randomized controlled trials that reported significant findings15. This study found that the mean effect size across studies with continuous outcome variables was 1.7 (95% confidence interval, 1.43 to 1.97)15. Fewer numbers of total outcome events in these studies strongly correlated with an increasing magnitude of the treatment effect. Since most orthopaedic studies have relatively small sample sizes, their conclusions are therefore often based on an overestimation of the treatment effect and therefore might not be valid.
Before embarking on a sample-size calculation, the investigator has to determine what difference he or she considers to be clinically relevant. Imagine the following scenario: You typically treat displaced femoral neck fractures with use of two cancellous screws. Your failure rate is approximately 30%, which is similar to failure rates reported in the literature. Your outcome parameter is dichotomous: the fixation either fails or it is successful. You wonder whether or not use of a third screw would decrease the failure rate, which you define as revision surgery. The key question to consider is what decrease in failure rate would convince you to use a third screw? In this example, let us consider that a 5% decrease from 30% to 25% would be clinically relevant to you and would convince you to use a third screw. Assuming that revision surgery is your primary outcome parameter, you are willing to accept a standard false-positive rate of 0.05. This means that you are willing to accept a 5% probability that, in your study, you will find a significant difference in failure rates between two screws and three screws, although in actuality there is none. You are also willing to accept a 20% false-negative rate that corresponds to a study power of 80%. This means that you are accepting a probability of 20% that, in actuality, there is a difference in failure rates between two screws and three screws, but that you will not find a difference in your study. In reverse, this means that you have an 80% probability of detecting a significant difference in your study when such a difference is truly present in the patient population that you are evaluating. Since we really cannot know whether a third screw has a positive or a negative effect on the failure rate, the sample-size calculation is two-tailed. In our example, the required study sample size is being calculated in the following way:n1=n2=[(2pmqm)1/2?z1-a/2+(p1q1+p2q2)1/2?z1-ß]2/?2withn1 = number of patients treated with two screwsn2 = number of patients treated with three screwsp1, p2 = sample probabilities (25% and 30%)q1, q2 = 1 - p1, 1 - p2 (75% and 70%)pm = (p1 + p2)/2 (27.5%)qm = 1 - pm (72.5%)? = difference = p2 - p1 (5%)z1-a/2 = z0.975 = 1.96 (for a = 0.05)z1-ß = z0.80 = 0.84 (for ß = 0.2)
n1 = number of patients treated with two screws
n2 = number of patients treated with three screws
p1, p2 = sample probabilities (25% and 30%)
q1, q2 = 1 - p1, 1 - p2 (75% and 70%)
pm = (p1 + p2)/2 (27.5%)
qm = 1 - pm (72.5%)
? = difference = p2 - p1 (5%)
z1-a/2 = z0.975 = 1.96 (for a = 0.05)
z1-ß = z0.80 = 0.84 (for ß = 0.2)
The z-scores that correspond to the desired study power and p value can be looked up in readily available statistical literature16 or on the Internet (keyword: "z-table"). From the above example, the z-score for a p value of 0.05 is 1.96 and the z-score for a study power of 80% is 0.84. In this example, we would need to treat 1250 patients with two screws and 1250 patients with three screws in order to detect a 5% decrease in failure rates from 30% to 25% with a probability of 80% to detect this difference if it truly exists (1 - false-negatives = power), accepting a 5% probability that we might find a difference although there truly is none (false-positive). A total of 2500 patients is quite a high number, especially for an orthopaedic study. This high sample-size requirement is typical for a dichotomous outcome parameter and often makes a study not feasible by a single surgeon or even a single center. An example of how to present a sample size calculation for a dichotomous outcome is shown in Table I.
Analogous to calculating the sample size for dichotomous variables, when calculating a sample size for a continuous outcome variable, the key issue is to determine the level of clinically relevant difference that would convince an investigator to switch from one treatment to the other, or, even more importantly, the level of minimally important difference that would matter to the patient. Let us assume in this example that the primary outcome variable is the Short Form-36 physical functioning score. On the basis of values from the literature, we hypothesize that, in our control group (femoral neck fracture treated with two screws), the average Short Form-36 physical functioning score is 60 points (on a scale of zero to 100, with 100 representing the best possible score) and the standard deviation of the Short Form-36 physical functioning score is 24 points1. Let us assume an effect size of 0.5, which represents 0.5 of the standard deviation (or 12 points), which is the minimally important difference that would make us change clinical practice and use the alternative treatment (three screws). How many patients do we need to detect this difference? As in the previous example for dichotomous variables, we are willing to accept a standard 5% (p = 0.05) chance that we have a false-positive result (meaning there truly is no difference between both treatment options, but our study shows a significant difference) and a standard 20% chance (which equates to 80% study power) that our study will not show a significant difference, although there is one in the patient population from which we took our sample. In this example the required study sample size is being calculated in the following way (two-tailed):n1=n2=2(s2)(z1-a/2+z1-ß)2/?2withn1 = number of patients treated with two screwsn2 = number of patients treated with three screws? = difference of outcome parameter between groups (12 points)s = sample standard deviation (24)z1-a/2 = z0.975 = 1.96 (for a = 0.05)z1-ß = z0.80 = 0.84 (for ß = 0.2)
n1 = number of patients treated with two screws
n2 = number of patients treated with three screws
? = difference of outcome parameter between groups (12 points)
s = sample standard deviation (24)
z1-a/2 = z0.975 = 1.96 (for a = 0.05)
z1-ß = z0.80 = 0.84 (for ß = 0.2)
We would require enrollment of sixty-four patients in each treatment group (a two-screw group and a three-screw group) in order to detect a difference of 12 points in the Short Form-36 physical functioning score with a probability of 80% to detect this difference if it truly exists (1 - false-negative = power), accepting a 5% probability that we might find a difference although there truly is none (false-positive). One can immediately see that the sample-size requirement in this example is far smaller than in the previous example with a dichotomous variable (failure rate). An example of how to present a sample size calculation for a continuous outcome is shown in Table I.
Study investigators should also consider that some patients will drop out of the study after enrollment or will not be available for follow-up. Therefore, the actual number of patients that has to be enrolled in the study is higher than that determined by the sample-size calculation. Study dropout and loss to follow-up typically becomes higher as the follow-up period becomes longer. Investigators should typically add 5% to 10% to their sample size to account for losses to follow-up, depending on the characteristics of their sample and the study design.
Studies in which a dichotomous outcome of interest is very rare can be challenging because the sample-size requirements are higher. Let us assume we are interested in exchange nailing (primary outcome) in patients who underwent reamed or unreamed nailing of a tibial shaft fracture. We hypothesize that the use of a reamed nail (experimental group) decreases the relative risk of necessitating an exchange nailing by 50% compared with the risk that would be associated with use of an unreamed nail (control group). The sample size needed in this study greatly depends on the baseline prevalence of the problem. Using the above equation for a sample-size calculation for dichotomous variables in order to detect a 50% decrease in relative risk of necessitating an exchange nailing after initial reamed nailing, we would need eighty-two patients in each group, if the prevalence of exchange nailing for unreamed nails is 40% (from 40% to 20%); 200 patients in each group, if it is 20% (from 20% to 10%); and 437 patients in each group, if it is 10% (from 10% to 5%) (Table I). To overcome the need for greater sample size if the outcome of interest is rare, study investigators can combine multiple outcomes of interest to one composite outcome. For example, instead of just looking at the rate of exchange nailing, consideration should be given to revision surgeries in general, including plating, nail dynamization procedures, and nail removal. The resulting higher number of outcome events increases statistical precision (narrows the confidence intervals), and consequently decreases the required sample size. However, composite outcomes can be problematic, because patients invariably do not consider all events included in the composite outcome to be of similar importance to them17. In our example, it should be obvious to patients and surgeons that, compared with nail dynamization or nail removal, exchange nailing or plating as a type of revision surgery has greater impact on the patients' function and is potentially associated with greater intraoperative risks. A second problem can be large differences in the prevalences of the single outcomes that make up the composite outcome. Let us assume that, in our example, the prevalence of nail dynamization is 20%, but the prevalence of exchange nailing is only 5%. When reporting a composite outcome that incorporates both events, the impact of nail dynamization would be exaggerated relative to the impact of exchange nailing. Readers could falsely assume that the treatment effect of reaming is a decrease in exchange nailing, whereas, in reality, it is mostly a decrease in nail dynamization. Therefore, it is justified to use composite outcomes to decrease sample-size requirements if patients attach similar importance to each outcome included in the composite and if the frequency of each outcome included in the composite is similar18. If those conditions are not met, the use of composite outcomes can create problems.
Compared with the use of only a single outcome measure, the use of multiple outcome measures, including generic and disease-specific health-related quality-of-life instruments, can improve the clinical relevance of a study and increase the value of a study for future clinical decision-making. When doing so, it is important to designate a primary outcome measure a priori in order to avoid increased sample-size requirements that are related to the p-value adjustment that is necessary to offset the increased risk of false-positive results associated with multiple outcome measures. Although the choice of the primary outcome measure should be guided by its clinical relevance, the feasibility of conducting a study with a certain outcome measure needs to be considered. From a practical point of view, it is important to realize that continuous outcome measures, such as health-related quality-of-life scores, are associated with lower sample-size requirements than are categorical outcome measures such as mortality, revision surgery rates, or union rates. The use of composite outcomes can overcome increased sample-size requirements that are associated with a low event rate of a dichotomous outcome of interest; however, doing so is only adequate if patients attach similar importance to each outcome included in the composite outcome measure and if the frequency of each outcome included in the composite is similar.
Despite criticisms, there is sound empirical evidence to suggest that half of a standard deviation of a continuous outcome measure, such as a health-related quality-of-life instrument, represents a minimally clinically important difference. This information can be used for sample-size calculations, especially in the absence of a clear consensus on what constitutes a minimally clinically important difference for an outcome measure of choice. 
Ware JE Jr, Snow KK, Kosinski M, Gandek B. SF-36 health survey: manual and interpretation guide. Boston: The Health Institute, New England Medical Center; 1996.Â
1996Â
Â
EuroQol—a new facility for the measurement of health-related quality of life. The EuroQol Group. Health Policy.1990;16:199-208.16199Â
1990Â
Â
Hudak PL, Amadio PC, Bombardier C. Development of an upper extremity outcome measure: the DASH (disabilities of the arm, shoulder and hand) [corrected]. The Upper Extremity Collaborative Group (UECG). Am J Ind Med.1996;29:602-8. Erratum in: Am J Ind Med. 1996;30:372.29602Â
1996Â
[PubMed][CrossRef] Â
Bonferroni CE. Il calcolo delle assicurazioni su gruppi di teste. In: Studi in onore del Professore Salvatore Ortu Carboni. Rome; 1935. p 13-60.Â
1935Â
Â
Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin Trials.1989;10:407-15.10407Â
1989Â
[CrossRef] Â
Hays RD, Farivar SS, Liu H. Approaches and recommendations for estimating minimally important differences for health-related quality of life measures. COPD.2005;2:63-7.263Â
2005Â
[CrossRef] Â
Wyrwich KW, Metz SM, Kroenke K, Tierney WM, Babu AN, Wolinsky FD. Measuring patient and clinician perspectives to evaluate change in health-related quality of life among patients with chronic obstructive pulmonary disease. J Gen Intern Med.2007;22:161-70.22161Â
2007Â
[CrossRef] Â
Wyrwich KW, Nienaber NA, Tierney WM, Wolinsky FD. Linking clinical relevance and statistical significance in evaluating intra-individual changes in health-related quality of life. Med Care.1999;37:469-78.37469Â
1999Â
[CrossRef] Â
Johnson JA, Coons SJ, Ergo A, Szava-Kovats G. Valuation of EuroQOL (EQ-5D) health states in an adult US sample. Pharmacoeconomics.1998;13:421-33.13421Â
1998Â
[CrossRef] Â
Swiontkowski MF, Engelberg R, Martin DP, Agel J. Short musculoskeletal function assessment questionnaire: validity, reliability, and responsiveness. J Bone Joint Surg Am.1999;81:1245-60.811245Â
1999Â
Â
Ware JE Jr, Sherbourne CD. The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care.1992;30:473-83.30473Â
1992Â
[CrossRef] Â
Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.Â
1988Â
Â
Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care.2003;41:582-92.41582Â
2003Â
Â
Miller GA. The magic number seven plus or minus two: some limits on our capacity for processing information. Psychol Rev.1956;63:81-97.6381Â
1956Â
[CrossRef] Â
Sung J, Siegel J, Tornetta P, Bhandari M. The orthopaedic trauma literature: an evaluation of statistically significant findings in orthopaedic trauma randomized trials. BMC Musculoskelet Disord.2008;9:14.914Â
2008Â
[CrossRef] Â
Motulsky H. Intuitive biostatistics. Oxford: Oxford University Press; 1995. p 374.Â
1995Â
Â
Freemantle N, Calvert M, Wood J, Eastaugh J, Griffin C. Composite outcomes in randomized trials: greater precision but with greater uncertainty? JAMA.2003;289:2554-9.2892554Â
2003Â
[CrossRef] Â
Montori VM, Permanyer-Miralda G, Ferreira-González I, Busse JW, Pacheco-Huergo V, Bryant D, Alonso J, Akl EA, Domingo-Salvany A, Mills E, Wu P, Schünemann HJ, Jaeschke R, Guyatt GH. Validity of composite end points in clinical trials. BMJ.2005;330:594-6.330594Â
2005Â
[CrossRef] Â