The eleven AAOS instruments are designed to assess the degree
to which a patient’s condition or conditions affect his or
her physical and emotional functioning, self-image, and symptom
status. These measures are self-reported and cover five general
areas of musculoskeletal function: the lower extremity, the upper
extremity, pediatric musculoskeletal function, the spine, and general
musculoskeletal function. Each of the instruments also includes
a three-response-option comorbidity checklist of fourteen
conditions or disorders16-18.
In addition, seven of the eleven instruments include the Short Form-36
(SF-36) Health Survey questionnaire19.
In the current study, normative data were collected for the lower
limb, sports knee, foot and ankle, hip and knee, upper extremity
(Disabilities of the Arm, Shoulder and Hand [DASH] questionnaire),
cervical spine, lumbar spine, and general musculoskeletal function
(Short Musculoskeletal Function Assessment [SMFA])
measures. In addition to the adult surveys, three different pediatric
surveys were administered. These consisted of a survey for children
(completed by parents of children who were between two and ten years
old) and two surveys for adolescents (one completed by adolescents
between the ages of eleven and eighteen and one, by a parent of
that adolescent). Parents who were surveyed were instructed to respond
by proxy about a specified child or adolescent. Adolescents who
were surveyed were matched to parents receiving proxy surveys and
were instructed to respond for themselves. As an inducement to complete
the survey, adolescents received a $5.00 payment.
The study was completed in four major stages: (1) the development
of a sampling design and sampling strategy, (2) the design and reconfiguration
of the clinical measures for the general population, (3) sample
selection and data collection, and (4) data cleaning and data analyses,
including application of scoring algorithms provided by the AAOS
to generate scaled scores, reliability and validity tests of the
data, and application of data reduction techniques to produce standardized
scaled scores within specified age, gender, ethnicity, and comorbidity
strata.
Sampling Design
The sampling methodology for this project was designed to garner
data representative of the general population of the United States
stratified by the following demographic markers: gender, comorbid
conditions, ethnicity, and specific age-groups. To meet this requirement,
a panel methodology was selected20,21.
The panel was a group of households recruited by National Family
Opinion.
Specifically, the National Family Opinion’s household
panel is a reliable sample of more than 475,000 households of individuals
and the families with whom they reside that are representative of
the United States population. Respondents are matched to the United
States Census data with respect to geographical region, age, income,
and household size. The panel is managed to maintain additional
demographic information such as the gender of household members.
Each year, approximately one-third of the National Family Opinion
panel is rotated out of participation and new panel participants
are recruited to replace those members. Panel households are balanced
not only demographically in the four Census regions and nine Census
divisions but also in correct proportion by state within each division.
Panel members are not compensated for participation. No inducement
was offered in this study with the exception of the $5.00
given to the adolescent cohort. Households are identified by frequently
used geographic classifications to provide complete sampling and identification
flexibility. This approach to sampling, with use of a single-wave
mail questionnaire, was deemed appropriate for a number of reasons.
First, it was assumed that the information provided by the respondents
would be valid and reliable. Each of the musculoskeletal measures
required that respondents provide information about their
physical, emotional, and social functioning capabilities as well
as symptom status. No items that were intended to elicit highly
sensitive or personal information, such as self-disclosure about
alcohol or drug misuse, feelings or beliefs about other people,
or attitudes about controversial social issues, were included. However,
the DASH measure does include one item on sexual function.
Previous research has shown that including items that ask subjects
to disclose information that they would not typically discuss in
casual conversation not only reduces response rates dramatically
but also calls into question the validity of the responses22. Requiring respondents to reveal
this type of information can trigger a "social desirability
effect," whereby individuals tend to respond to personal
questions with answers that they believe are socially acceptable22. Because no such items are included
in the AAOS measures, it was assumed that responses to questions
included in those measures would be both candid and truthful.
Second, the size and scope of the study required a cost-efficient
and expeditious methodology. Compared with random samples generated
with use of census tract data or other methods, which typically
produce low response rates in the 20% to 25% range
with use of a multiple-wave mailing strategy, panel studies have
been shown to yield response rates of 60% or higher with
a single-wave mailing20,23. Additionally,
with the exception of comorbid conditions, the demographic markers
(age, gender, etc.) required for post-stratification were known
for the panel before selection. This information, along with the
high response rate associated with panel studies, facilitated sampling,
permitted careful targeting of respondents to increase the likelihood
that the margin of error set a priori for each measure (±3
points on a 100-point metric) would be met, and ensured acceptable
sample representation within strata.
Finally, by monitoring response rates within strata, decisions regarding
additional sampling could be made and executed promptly. This shortened
the time required to complete the data collection phase of the study.
Data were collected over a six-week period in the spring of 1999.
Survey Design
Of primary concern in the survey design process was the reconfiguration
of the eleven condition-specific clinical questionnaires into measures
appropriate for the general population. Questionnaires were modified
to increase the validity, reliability, and utility of the data gathered
for this project. For example, the condition-specific sections of
the clinical questionnaires were all prefaced with comments that
asked patients to answer items in reference to a particular orthopaedic
condition for which they were currently being treated or for which
they were receiving follow-up care (e.g., "Please answer
questions about your foot/ankle which is being treated.").
These types of instructions are clearly inappropriate for the general
population. Therefore, a number of simple, yet necessary, changes
were made in survey layout and wording.
First, instructions preceding each of the subsections of the clinical
measures were amended to elicit general evaluations of the respondent’s
hip, knee, ankle, etc. For example, for the lower limb measure,
the banner was edited to read: "Please answer the following
questions about your lower limb, which includes the hip, leg, foot,
and toes. These questions are about how you have felt on average.
It is very important that you fill out each item." To increase
the likelihood that respondents would respond to all of the items,
a motivational sentence stressing the importance of completing each
item was included.
Second, a series of screening and skip questions was added to allow
clinicians to identify and select into subcategories those individuals
within the sample who had (1) never had a problem with, e.g., a
lower limb, (2) had a problem with a lower limb but never sought
medical treatment, or (3) had received medical treatment for a lower
limb. All scores reported in the present study were derived from
the total samples and were not broken out on the basis of the responses
to the screening questions.
A number of skip pattern instructions were included in the screening
question described above. The first skip pattern instructed respondents
who had indicated that they had received medical treatment to indicate
whether they had had surgery for the problem. The respondents who
answered that they had a problem but had not sought medical treatment
were instructed to mark "All That Apply" to a
list of reasons why they had not sought treatment (e.g., "My
lower limb problem/injury was not serious enough to seek
treatment."). Respondents who had never had a problem were
instructed to skip both of those items.
Taken together, the changes in survey format were relatively minor,
and all items included in the clinical measures were retained. The
reconfigured questionnaires, formatted for the general population,
are essentially isomorphic to the clinical measures.
Sample Selection and Data Collection
Selection of the samples for the eleven measures was completed
with use of random-sample-selection algorithms. To ensure that the
requirements of the study would be met post-stratification, demographic
frequency counts were conducted on each sample drawn to determine
that adequate representation criteria within strata were met.
Personalized survey questionnaires were sent by direct mail to sampled
panel participants. For each of the eleven measurement instruments,
an initial 2920 surveys were mailed. On the basis of previous experiences
with the National Family Opinion panel, this mailing was expected
to yield a 65% response rate that would render approximately
1898 usable surveys for each of the eleven conditions.
The decision to set response rate expectations at 65% was based
on a number of factors. First, a review of the survey research literature
suggested that response bias effects are greatly minimized as response
rates exceed 50%20,23.
Second, the confidence intervals around survey values are very small
given the raw number of returns that were anticipated. Last, previous
experience with the National Family Opinion panel suggested that
additional surveying of nonresponders would not yield a sufficient
number of returns to justify the additional cost.
Data Analyses
The first phase of data manipulation involved running frequencies
on all data points. This was done in order to identify the percentages
of missing or out-of-range values. All out-of-range values were
assigned to a missing-value category. Scoring algorithms and validity
tests on all scaled items required item completion criteria to be
met before scoring could be completed. Each of the AAOS instruments
contains items that are scored individually. However, the majority
of the questionnaire items are aggregated into conceptually distinct scales
designed to measure physical and mental functioning of the patient
and symptom status. With the exception of the comorbidities scale
(described below) and single-item measures, each scale is composed
of the summated mean scores from related items.
Summative scale scores were calculated only for individuals who
answered at least half of the items of a scale (or half plus one
for scales with an odd number of items). With the exception of the
hip and knee function and limitation scale, the global foot and
ankle scale, and the DASH module, items of a scale that were not
completed by respondents were calculated into the mean score for
that scale. For the DASH instrument, if 10% of the items
of any scale were missing, that individual’s scale scores
were treated as missing values. If <10% of the
items were missing, the rest of the items were scored and averaged,
that mean score was imputed to the missing items and rounded to
the nearest integer, and the scale score was then calculated. Each
scale was calibrated to a 100-point metric scored from 0 to 100.
The DASH and SMFA were scored so that 0 represented the least disability
or best health and 100, the most disability or worst health. All
other scales were scored so that 0 represented the worst health
and 100, the best health. Calibrating scores to this metric allowed
for direct comparison with SF-36 scores and was generally easier
to interpret for diverse audiences19.
Tables IthroughV present the scoring for each of the eleven measures,
including sample sizes, mean scores, and standard deviations.
Comorbidity Checklist Scoring
The comorbidity checklist18 component
of the surveys required respondents to provide, for each comorbid
condition listed, a yes-or-no response to three questions: (1) "Do
you have the problem?", (2) "Do you receive treatment
for it?", and (3) "Does it limit your activity?" Each
of these responses is then used to calculate a general comorbidity
index and three subscales composed of scores from related items.
The comorbidity index is calculated as the sum of "yes" responses
(x) across all response options divided by the total number of possible "yes" responses,
or comorbidity index = x/42 ¥ 100.
Reliability and Validity Analyses
Multitrait-scaling techniques were used to assess the reliability
and validity of the eleven reconfigured AAOS measures. The Multitrait/Multi-Item
Analysis Program (MAP) is a straightforward methodology for scale
analysis24,25. In multitrait scaling,
scale items are evaluated in terms of four scaling criteria: (1)
convergent validity expressed in terms of internal consistency,
(2) item discriminant validity, (3) tests for equal item-total correlations,
and (4) equal variance test of scale items.
Multitrait scaling involves examination of item frequencies, item
and scale descriptive statistics (e.g., mean, standard deviation,
and variance), scale internal consistency estimates, item-scale
correlations (corrected for overlap), and correlations among scales.
Multitrait scaling goes beyond traditional tests of internal consistency
primarily because it tests item discrimination across scales. Thus,
items are evaluated with respect to how well they represent a particular
construct relative to other constructs.
In multitrait scaling analysis, related scale items within a measurement
instrument are summated. These summated rating scales are then statistically
compared with each other in order to test assumptions of validity
and reliability within the instrument. Questions are grouped into
conceptually related scales on the basis of the underlying concept
that they are theoretically intended to measure. In order to preserve
as much of the sample for analysis as possible, mean replacement
of missing data is performed on a case-by-case basis. If an individual
respondent was missing data for less than half of the items within
a given scale, that person’s mean score for the items to
which he or she responded was substituted for all missing data points
within the scale. For individuals for whom more than half of the
scale items were missing, the items that were not missing were assigned
to a missing-value category and thus were excluded from the analysis.
This mean replacement approach was used solely for reliability and
validity testing.
Multitrait scaling analyses were then performed on the eleven survey
instruments on the basis of three conceptual models in order to
assess (1) item internal consistency validity, (2) item discriminant
validity, and (3) internal consistency reliability of the AAOS measures.
Response Rates
The range of confidence intervals for the eleven instruments ranged
from ±1.6% to ±2.3%,
which met the required confidence interval criterion of ±3% established a
priori. The response rate overall for the eleven measures
was 67.4%, which exceeded the 65% rate expected.
The response rates for adult stand-alone surveys were similarly
higher than the 65% response rate expected. The parent-adolescent
and adolescent survey outgo was larger than that of the stand-alone
surveys, with the mailings reflecting the matching of surveys for
these two groups. However, the anticipated 65% response
rate was not obtained for the parent-child surveys (61.4%)
or the parent-adolescent-matched surveys (62.9%). Examination
of the response frequencies for these surveys to identify possible response
bias effects due to differences in the characteristics of responders
and nonresponders did not reveal evidence of systematic biasing
effects. Therefore, the decision to not mail additional surveys
to nonresponders was based on an assumption that an additional mailing
to parents would not yield meaningful gains in terms of statistical
power or precision in that the reduction of the margin of error
for the parent studies would be <1% (~0.007).
Scale Scores
Tables I through V present the sample sizes, mean converted scaled
scores, and standard deviations for each of the eleven AAOS measures.
Norm-based scores within age, gender, ethnicity, and other relevant
demographic markers are available through the AAOS.
Reliability and Validity Tests
Alpha Reliability and Item-to-Scale Pearson’s
Product-Moment Correlation Coefficients
It would be very difficult, at best, to report all of the statistics calculated
to test the reliability and validity of eleven separate multiple-scale
instruments in a traditional results section format. Summary descriptions
and statistics for the analyses conducted are described below. Table VI displays the number of scales within each measure, the range
of Cronbach’s alpha coefficients for each summated scale
within each measure, and the range of Pearson’s product-moment
item-to-scale correlation coefficients corrected for overlap.
Cronbach’s alpha, expressed as a coefficient between
0 and 1, is a measure of the degree to which a set of items (e.g.,
items in a scale) measure a single unidimensional latent construct, such
as stress or pain26. When data
have a multidimensional structure, Cronbach’s alpha usually
is low. Cronbach’s alpha for scale items that together
are tapping a unidimensional construct is high, reflecting high
internal consistency among scale items. As is reported below, all
of the AAOS instruments exhibited high alpha reliability in that
their alpha coefficients all exceeded 0.80 with the exception of
one scale in the parent-child cluster.
The correlation between two variables reflects the degree to which
variables are related, that is, the extent to which two variables
covary23. The most common measure
of correlation is the Pearson product-moment correlation usually
designated by the letter "r" and sometimes called "Pearson’s
r." Pearson’s correlation values, ranging from +1
to —1, reflect the degree of the linear relationship between
two variables. A correlation of +1 means that there is
a perfect positive linear relationship between variables.
Item Discriminant Validity Tests
In MAP (Multitrait/Multi-Item Analysis Program) scaling, discriminant
validity assesses the extent to which correlations of items to their
own scales is higher than their correlations to other scales24. In MAP scaling, item internal consistency
of 90% is scored as satisfactory. Scaling success is achieved
when 80% of the item-to-scale correlations in the total
data set and within each individual scale are greater than two standard
errors (Table VI).
For this normative data-collection project, eleven musculoskeletal
functional outcomes assessment instruments developed through the
AAOS were modified for use in the general population of the United
States. As anticipated, each of the AAOS scale scores was uniformly
high and skewed toward more values representing good health. Given
that the intent of this project was to collect data from the general
population, this outcome is wholly in line with project objectives.
However, an examination of the mean scores reveals meaningful variability,
increasing their utility as comparative measures. When the scores
are examined by quartiles and as percentages of the scores at the
floor and ceiling ranges, it is evident that the normative scores
will be most useful for the assessment of populations whose baseline
scores are meaningfully lower (i.e., representing poorer health).
In other words, important changes will be more difficult to detect
for individuals undergoing treatment or being followed who report
scores toward the end of a given scale representing better health.
A review of the tests reported in the present study indicated that,
without exception, the scales in the surveys used in the normative
data project met assumptions for reliability and validity8. Additionally, standard deviations
of the scores for scale items were roughly equivalent within the
scales (i.e., approximately 1.0 for items with five response options),
precluding the need to standardize individual items before scaling.
Mean scores showed greater variability, which is to be expected
for items measuring physical activities (from self-care to strenuous
activities) because most populations vary in their underlying ability
to perform these activities.
The item-to-scale scores across all eleven surveys revealed that
all but four of the scaled items had higher correlations with their
hypothesized scales than they did with competing scales—that
is, they demonstrated acceptable discriminant validity scores. The
four items that did not have higher correlations with their own
scales were on the parent-child survey. Three items asked the parents
to rate the degree of ease or difficulty for their child to (1)
use a fork and spoon, (2) put on socks, and (3) turn a doorknob.
The fourth asked how often the child used an assistive device for
walking or climbing.
All but four items met the 0.40 standard for internal consistency.
The four items, also on the parent-child survey, yielded lower-than-desired
scores on tests of item internal consistency. Two of the items listed
above were again identified as exhibiting less-than-desirable scaling
scores, along with two others. One of the other items required parents
to estimate how frequently in the past week their child got together
with friends to do things, while the other asked them to estimate how
often their child participated in gym or recess in the past week.
All of the items that failed to meet minimum scaling standards are
found on one instrument, the pediatric parent-child survey. This
is noteworthy because a parent completed this survey for a child
two to ten years of age. One might anticipate greater variability
for responses that estimate another person’s functional
status. Parents are more likely to overestimate or underestimate
their child’s physical functioning capabilities for any number
of reasons. Additionally, it is reasonable to expect that young
children will vary greatly in their ability to perform even very
simple tasks. However, despite these scaling failures, none of the
items substantially reduced the reliability or validity scores for
the scales in which they were embedded. Each scale in the parent-child
survey still met minimum standards for reliability and validity
despite the scaling failure of these few items.
Although a number of differences were found among the various
categories of response rates examined, there was no clear-cut evidence
to suggest that systematic responder versus nonresponder response
bias effects were likely.
In summary, the present study describes valid normative data for
a series of questionnaires that address issues of musculoskeletal
health. Such data provide a valuable baseline and permit age and
gender adjustments of data collected with use of these questionnaires.
The results of studies with use of these questionnaires can then
be placed in a firmer context, which should, in turn, allow more
clinically relevant conclusions to be drawn.