Extract
To interpret the quality of the information presented in an orthopaedic study, it is necessary to have a basic knowledge of study design principles and statistics. Similarly, to conduct a study, surgeons need to understand the basic research concepts to consider during the design phase to ensure the results add as much evidence as possible. This article presents the basics of study design and statistical concepts relevant to an orthopaedic surgical researcher.
To interpret the quality of the information presented in an orthopaedic study, it is necessary to have a basic knowledge of study design principles and statistics. Similarly, to conduct a study, surgeons need to understand the basic research concepts to consider during the design phase to ensure the results add as much evidence as possible. This article presents the basics of study design and statistical concepts relevant to an orthopaedic surgical researcher.
In designing a study protocol, researchers should use the strongest study design that adds the most to the evidence base in a practical and ethical way. The study question, the rarity of the condition to be investigated, and the resources available influence the most appropriate design for a given study1-3. In general, clinical studies fall into two categories: experimental and observational4,5.
In experimental studies, a patient's exposure to an intervention is manipulated by the researcher and the patient is observed for a period of time to detect the effects of this intervention on predetermined outcome measures. The manipulation involves the random allocation of patients to groups, and hence studies of this type are termed randomized trials. The effect of randomization is to control as many factors as possible so the groups are similar except for the variable of interest6. In this way, a causative link can be determined.
Observational studies allow for associations between variables to be investigated, but the evidence is not strong enough to establish a causal link5. Because the study is not controlled, there are variables that may influence the results and, therefore, there is inadequate confidence to say that x caused y. In observational studies, a patient's exposure and outcome status are observed with no influence from the researcher. The timing of the exposure and outcome defines each observational study design.
In cohort studies, patients are allocated to groups according to their exposure to a treatment or condition of interest and are followed for a period of time to determine if they develop a specified outcome (Fig. 1). At the end of the study period, the outcome in the exposed group is compared with that in the unexposed group. Cohort studies are not feasible when the condition of interest is rare because the follow-up period required to get a large enough sample size to show an effect is prohibitive. The major problem with the cohort study design is that it is not possible to be certain the cohorts were well matched and/or to know what factors influenced the study results (these factors are known as confounders, described in more detail below).
In a case-control study, a group of patients with a defined condition is compared with a group of patients without the condition, and their previous exposure to a risk factor is compared (Fig. 1). This is a relatively quick and inexpensive study design that is particularly useful for rare conditions or when the time between exposure and outcome is long. While this study design is also subject to confounding, matching cases and controls on a number of variables at the time of analysis can help to decrease the effect that confounding variables can have on the result. For example, a case patient can be matched with a control patient by sex and age within five years (as well as other study-specific variables) to ensure that the two groups are as similar as possible.
Cross-sectional studies analyze patients at a single point in time (Fig. 1) and are useful to determine what proportion of patients have a condition at that time point (known as prevalence) and what other factors may be associated with the condition. This study design relies on accurate recall by the patient with regard to whether he or she had been exposed to a certain risk factor or had a condition of interest.
Case series and case reports are useful to describe novel conditions or treatments in a small number of patients (generally from a single institution or surgeon). They are useful for conveying information promptly to other surgeons, and the information presented in such reports may be used to generate hypotheses to design studies with use of a more robust study design.
Whatever the chosen design, surgeons should be aware of the overall validity of the study in relation to the known hierarchy of evidence.
Keeping up to date with the latest research findings, while also attending to a busy clinical practice, is a time-consuming task that is difficult for the surgeon to do. In assessing the literature to determine best evidence-based practice, a number of factors can be taken into account. One of the simplest and most commonly used criteria is the level of evidence. A hierarchy exists that places each study along a continuum based on the likelihood of bias within a given study design. The levels of evidence are different, depending on the primary focus of the study (i.e., therapy, prognosis, harm, or economic analysis)5,7. Many different hierarchies are available5,8,9, but a general overview is given in the present report.
The gold standard by which clinical research is judged is the randomized controlled trial, which is designated the highest level of evidence. Similarly, meta-analyses or systematic reviews that collate the results of high-quality randomized controlled trials are also designated as level-I evidence. Level-II evidence includes lesser-quality trials (e.g., those with <80% patient follow-up or improper randomization), systematic reviews or meta-analyses of these second-tier studies, and prospective cohort studies. Level-III evidence includes case-control, retrospective cohort, and systematic reviews of these third-tier studies. The lower levels of evidence (levels IV and V) include case series or opinions, respectively.
Level-I and II reports should be thoroughly assessed to ensure that the methodology meets all of the criteria for the level assigned7,10,11. For example, the authors may have performed a cohort study, but is there a suspicion of bias in the design of the study or are there confounding variables that the authors have not noted and/or accounted for? This additional step of reviewing study methodology is the key to evidence-based practice. A randomized trial with a poor methodology and a relatively small sample size will not necessarily provide more evidence than a well-conducted cohort study. All research has value, and it is up to the clinician to determine the value of each piece of information in a study12.
It is generally not logistically possible for a researcher to obtain data from the entire population that he or she wants to study. The role of statistics is to allow a researcher to sample a portion of the population and to use probability to decide whether the findings from the sample are likely to apply to the entire population13. In this process of research, a number of jargon terms with specific meanings are used.
Bias is a systematic error in the design or conduct of a study that produces an outcome that is different from the underlying truth14. There are several types of bias that need to be considered, depending on the study design. Attrition bias refers to systematic differences between the groups with regard to the number of participants who are lost to follow-up, withdraw consent, or die. Expertise bias exists when a surgeon has a higher competence or familiarity with one study procedure over another, meaning that there is a chance of a better outcome in one group. Recall bias refers to a phenomenon whereby a patient who has an adverse outcome is more likely to recall an exposure than a patient who has a better outcome, independent of the true exposure. Selection bias occurs when the allocation of groups leads to a difference in the baseline characteristics of one study group compared with another.
A confounding variable (or confounder) acts to distort the association between two variables of interest because of its strong relationship with both variables. For example, an association between infection rates and open fractures may be confounded by the severity of the initial soft-tissue injury of the open fracture wound, since it would be expected that the larger wounds, which require longer and possibly more operations, would be more susceptible to infection. Frequent confounders include sex, age, socioeconomic status, and comorbidities.
The null hypothesis is the default hypothesis that assumes that there are no differences between the groups. The alternative hypothesis suggests that there is a difference between the study groups. The chance of the null hypothesis being true is what is evaluated by a statistical test.
A type-I error, or alpha error, occurs when the null hypothesis is rejected when in fact it is true, leading to a false-positive result. That is, the researcher concludes that there is a difference between the groups when there is not. The probability of committing such an error, the alpha level, is generally set at the 0.05 level and assumes that, 5% of the time, the statistical test will find a difference between the groups purely due to chance alone.
A type-II error, or beta error, occurs when the researcher incorrectly accepts the null hypothesis. The probability of committing such an error, the beta level, is generally set at the 0.20 level (i.e., 20%). This means that the researcher is willing to accept a 20% chance of concluding that there are no differences between the groups when there actually is one.
The effect size is the difference between the groups that needs to be detected to establish a true difference and is based on the results of pilot data or values in the literature. It is used in the calculation of sample size.
Study power refers to the probability of concluding that there are no differences between the groups when there actually is one and is equal to: 1 – (the value of beta error). By convention, beta is set at 0.20 and the power of the study is set at ≥80%, which means that there is a ≤20% chance that the study will demonstrate no significant difference when there is one. Studies are said to be “underpowered” when the power of the study is <80%.
The statistical p value is a measure of the strength of the evidence that is provided by the data that the null hypothesis is true. When the p value is below alpha, the researcher can be confident that the evidence is strong enough for the null hypothesis to be rejected and can conclude that the result was significant.
Normality refers to whether the data approximate the shape of a bell curve when plotted as a bar graph. Most statistical tests assume that the data fit the bell curve. When this assumption is broken and the data are skewed, specific types of analyses (known as nonparametric statistics) can be undertaken; otherwise, the conclusions drawn may not be valid. In the frequently reported small samples analyzed in orthopaedic studies, it is rare to obtain normal data.
Categorical data (also known as qualitative data) assign each patient to a single category or type. The simplest example of categorical data is a binary variable consisting of two possible groups (e.g., male/female and fracture united/not united). Categorical variables that have three or more groups that have no natural order (e.g., blunt/penetrating/burn) are called nominal data, whereas the variables that have a natural progression through the categories that is not to scale (e.g., Gustilo open fracture classification) are known as ordinal data. Categorical data are often presented in papers as percentages in tables or bar charts.
In the case of quantitative data, differences between numbers have meaning across a scale (e.g., age, height, and weight). When the variable can only have values that are integers, the data are known as discrete. Data that can be recorded to n decimal places are called continuous data. Quantitative data can be condensed to ordinal categorical data; for example, the age of patients participating in a hip fracture study could be broken into categories of fifty to sixty years, sixty to seventy years, and greater than seventy years of age. However, in designing a study and planning the data collection methods, when possible, it is preferable to obtain the data as a continuous variable as this can be changed to discrete or categorical data, but the reverse is not possible15.
Before the statistical tests are performed, it is a good idea to conduct an exploratory data analysis. This simply involves obtaining such values as the mean, median, and range to ensure that all values entered into the database are appropriate values. This helps to detect data entry errors or gross outliers that need to be checked for accuracy. In addition, the bell curve should be plotted to examine the assumption of normality.
In deciding which statistical test to use, the researcher must first define what type of variables he or she wants to analyze. Table I provides a summary of the tests that are appropriate for specified data types. In basic terms, for comparisons of groups, the test to use is dictated by the type of data and whether the data are normally distributed. For determining associations or creating models, the type of data determines which test to use. While this information allows for data analysis with use of the appropriate test, it is very important that the assumptions for each test are examined to ensure that the results and conclusions are valid. For this reason, although modern-day computer programs allow anyone to obtain a p value for his or her data and to present the results to his or her peers, it is always best to consult a statistician to ensure that the correct test is being used and that all assumptions are being met. However, this does not detract from the importance of the clinical researcher having a basic understanding of statistical principles.
As an example, if a researcher wanted to compare hospital lengths of stay (continuous data) for patients having a Gustilo type-IIIB open tibial shaft fracture treated with either the principles of early total care or the principles of damage control orthopaedics, a Student t test would be appropriate when the length-of-stay data are normally distributed. If the data were skewed (as is common for length of stay), the assumption of normality underlying the t test would not be met and the nonparametric Mann-Whitney U test would be more appropriate. Note that when the Student t test is used, the results should be summarized with use of mean and standard deviation values, and when the Mann-Whitney U test is used, data should be summarized with use of median and range values. Another alternative is to create a so-called dummy variable by breaking the continuous variable up into categorical groups. In this example, one could arbitrarily split the continuous length-of-stay data into weeks (e.g., less than one week, seven days to thirteen days, fourteen days to twenty days, and greater than twenty days) and analyze the data with use of the chi-square test. Of course, this should always be defined in the protocol prior to collecting the data, as so-called data-dredging by performing multiple analyses searching for a significant result will find a false-positive significant result by chance in one of every twenty comparisons (when alpha equals 0.05).
Ideally, a researcher would enroll as many patients as his or her resources, time constraints, and ethics committee would permit. The reality is that a researcher may have only enough funding to hire a research assistant to collect data for one year. As such, a sample size calculation early in the process of designing a study is important for a number of reasons. First, it provides information about the feasibility of the study. If a researcher wanted to conduct a study to investigate the role of pelvic binders, angiographic embolization, pelvic packing, and early internal fixation on reducing mortality following pelvic fracture and the sample size calculation finds that sixty patients are needed to find a significant difference, the study is not feasible if the researcher generally treats twelve pelvic fractures in a year. Second, a sample size calculation can provide information on whether the study has enough power to detect a clinically relevant difference. Third, sample size is important from an ethical standpoint. An undersized study submits patients to the added burden and to the potentially harmful experimental treatment of a study with no great advance in the evidence base. Fourth, a sample size calculation helps as the researcher finalizes the study protocol and looks at writing grant applications because it provides a guide as to what resources are needed.
Sample size calculation is more of an art than a science, as educated guesses are used to indicate how many patients are required to detect a researcher-defined clinically relevant difference. The particular method used to perform a sample size calculation differs, depending on the type of study being performed. Importantly, while many calculation tools are available in textbooks and on the Internet, they tend to use more advanced statistical jargon terms that can be confusing and some may require knowledge of how to read statistical tables of distributions that are more at home in the appendices of statistical textbooks. As such, it is advisable for a researcher, especially one who is new to research methodology, to seek the expertise of a statistician. However, with the following key elements, there are some sample size calculators that may be used with confidence:
Type-I error (alpha): It is generally set at the 0.05 level.
Type-II error (beta): It is generally set at the 0.20 level.
Power: It is generally set at 80%.
Effect size: This is where a thorough literature review, results from a pilot study, and clinical experience all meld together to come up with a figure that defines how much of a difference one can expect to observe between the study groups. For example, a researcher may find a previous study done in another country that observed differences between the groups of 15%; however, in the researcher's experience at his or her institution, the difference is more likely to be 25%. As such, the researcher may select an effect size of 20% to use in the calculation, remembering that the lower the effect size, the higher the sample size needs to be to detect a clinical difference.
Variability: Again, a review of the literature or the results of a pilot study will tell a researcher the expected standard deviation within each group.
It is important to remember that the number produced by the sample size calculator is for the minimum number of participants required, given the values for effect size and/or variability that were entered. Extra participants are often added to account for those who drop out or withdraw.
While the urge to scan a paper looking for significant p values (p < 0.05) as the basis for determining the importance of the results is tempting, it is often quite misleading. This approach equates to thinking of the p value as a dichotomous variable rather than a continuous one16 and fails to recognize the foundation of inferential statistics, whereby a significant finding will be found purely by chance in one of twenty analyses. Significant findings may not reflect clinically important outcomes.
The methods section could be considered to be the most important section of a paper because if the methodology is not sound, then the results of the research will not be valid17. One of the simplest things to look at is the number of patients enrolled. Small sample sizes result in underpowered studies that have little chance of finding a clinically relevant effect size. Therefore, studies with small samples that find a larger effect size than one would expect should be suspected of having biased methodology, which should be examined thoroughly.
In conclusion, the summary of research methodology in the present report should be considered by orthopaedic surgeons as they read journal articles or sit in conference presentations. The basics of study design and statistics are provided to direct the clinical researcher to pay particular attention to the methodology of orthopaedic investigations. Such knowledge is invaluable for surgeons who want to be able to judge the value of new clinical evidence described in studies and for those who want to develop their own investigations.
Ward
RC;
Hruby
RJ;
Jerome
JA;
Jones
JM;
Kappler
RE, . Foundations for osteopathic medicine. 2nd ed.Philadelphia: Lippincott Williams & Wilkins; 2002.
Souba
WW;
Wilmore
DW, . Surgical research. 1st ed.San Diego: Academic Press; 2001.
Gallin
JI;
Ognibene
FP, . Principles and practice of clinical research. 2nd ed.Burlington: Elsevier; 2007.
Grimes
DA;
Schulz
KF. An overview of clinical research: the lay of the land. Lancet.
2002;359:57-61.
Bhandari
M;
Joensson
A, . Clinical research for surgeons. New York: Thieme; 2009.
Schulz
KF;
Grimes
DA. Generation of allocation sequences in randomised trials: chance, not choice. Lancet.
2002;359:515-9.
Guyatt
GH;
Haynes
RB;
Jaeschke
RZ;
Cook
DJ;
Green
L;
Naylor
CD;
Wilson
MC;
Richardson
WS. Users’ Guides to the Medical Literature: XXV. Evidence-based medicine: principles for applying the Users’ Guides to patient care. Evidence-Based Medicine Working Group. JAMA.
2000;284:1290-6.
Atkins
D;
Eccles
M;
Flottorp
S;
Guyatt
GH;
Henry
D;
Hill
S;
Liberati
A;
O'Connell
D;
Oxman
AD;
Phillips
B;
Schünemann
H;
Edejer
TT;
Vist
GE;
Williams
JW
Jr; GRADE Working Group. Systems for grading the quality of evidence and the strength of recommendations I: critical appraisal of existing approaches The GRADE Working Group. BMC Health Serv Res.
2004;4:38.
Schünemann
HJ;
Bone
L. Evidence-based orthopaedics: a primer. Clin Orthop Relat Res.
2003;413:117-32.
Bhandari
M;
Guyatt
GH;
Swiontkowski
MF. User's guide to the orthopaedic literature: how to use an article about prognosis. J Bone Joint Surg Am.
2001;83:1555-64.
Bhandari
M;
Guyatt
GH;
Swiontkowski
MF. User's guide to the orthopaedic literature: how to use an article about a surgical therapy. J Bone Joint Surg Am.
2001;83:916-26.
Poolman
RW;
Petrisor
BA;
Marti
RK;
Kerkhoffs
GM;
Zlowodzki
M;
Bhandari
M. Misconceptions about practicing evidence-based orthopedic surgery. Acta Orthop.
2007;78:2-11.
Dowdy
S;
Weardon
S;
Chilko
D. Statistics for research. 3rd ed.Hoboken: John Wiley & Sons, Inc.; 2004.
Bhandari
M;
Tornetta
P
3rd;
Guyatt
GH. Glossary of evidence-based orthopaedic terminology. Clin Orthop Relat Res.
2003;413:158-63.
Riffenburgh
RH. Statistics in medicine. 2nd ed.San Diego: Elsevier Academic Press; 2006.
Rosnow
RL;
Rosenthal
R. Statistical procedures and the justification of knowledge in psychological science. Am Psychol.
1989;44:1276-84.
Urschel
JD. How to analyze an article. World J Surg.
2005;29:557-60.