Abstract
Background: The reproducibility and repeatability of modern systems
for classification of thoracolumbar injuries have not been sufficiently
studied. We assessed the interobserver and intraobserver reproducibility of
the AO (Arbeitsgemeinschaft für Osteosynthesefragen) classification and
compared it with that of the Denis classification. Our purpose was to
determine whether the newer, AO system had better reproducibility than the
older, Denis classification.
Methods: Anteroposterior and lateral radiographs and computerized
tomography scans (axial images and sagittal reconstructions) of thirty-one
acute traumatic fractures of the thoracolumbar spine were presented to
nineteen observers, all trained spine surgeons, who classified the fractures
according to both the AO and the Denis classification systems. Three months
later, the images of the thirty-one fractures were scrambled into a different
order, and the observers repeated the classification. The Cohen kappa
(?) test was used to determine interobserver and intraobserver
agreement, which was measured with regard to the three basic classifications
in the AO system (types A, B, and C) as well as the nine subtypes of that
system. We also measured the agreement with regard to the four basic types in
the Denis classification (compression, burst, seat-belt, and
fracture-dislocation) and with regard to the sixteen subtypes of that
system.
Results: The AO classification was fairly reproducible, with an
average kappa of 0.475 (range, 0.389 to 0.598) for the agreement regarding the
assignment of the three types and an average kappa of 0.537 for the agreement
regarding the nine subtypes. The average kappa for the agreement regarding the
assignment of the four Denis fracture types was 0.606 (range, 0.395 to 0.702),
and it was 0.173 for agreement regarding the sixteen subtypes. The
intraobserver agreement (repeatability) was 82% and 79% for the AO and Denis
types, respectively, and 67% and 56%, for the AO and Denis subtypes,
respectively.
Conclusions: Both the Denis and the AO system for the classification
of spine fractures had only moderate reliability and repeatability. The
tendency for well-trained spine surgeons to classify the same fracture
differently on repeat testing is a matter of some concern.
The ideal system for spine fracture classification should serve as a
method for accurate communication between treating physicians. It should
provide information about the severity of the injury and the pathogenesis or
mechanism of the injury, and it should guide the choice of treatment. The
system should be easy to recall, with consistent characteristics based on
precise and descriptive
terminology1.
Categories or types should be distinct enough to describe different natural
histories and treatment
implications2.
Most spine injury classification systems are based primarily on the
presumed mechanisms of injury and the radiographic patterns of disruption. One
of the most frequently employed systems is that of
Denis3, in which the
mode of failure of the so-called middle
column4 of the
vertebrae is used to classify the fracture type and then to predict any
subsequent risk of instability and/or neurological injury. In the Denis
scheme, spinal injuries are classified into four different types: compression
fractures, burst fractures, seat-belt-type injuries, and
fracture-dislocations. Each of these types is then divided into three, four,
or five groups, with sixteen subtypes in total.
In 1994, a comprehensive classification scheme known as the AO
(Arbeitsgemeinschaft für Osteosynthesefragen) system was introduced by
Magerl et al.1, on
the basis of a ten-year review of 1445 thoracolumbar injuries. It is based on
a progressive scale of increasing morphological damage and morbidity. It
consists of three main fracture types—A (compression), B (distraction),
and C (fracture-dislocation) injuries—each of which is divided into
three subtypes. Each of those three subtypes is divided into up to three
subgroups, and each of those subgroups is divided into up to three
subdivisions. Type-A injuries affect only the anterior column, and type-B and
type-C injuries affect both the anterior and the posterior column. Thus,
severity increases from type A to B to C as well as within the subtypes,
subgroups, and subdivisions. It is a very inclusive, albeit complex
system.
There have been few studies evaluating the reproducibility and reliability
of systems for the classification of spine
injury5,6.
We sought to measure the interobserver reliability and intraobserver
repeatability of the AO and Denis classification systems to determine their
usefulness in fracture management. Our hypothesis was that use of the more
comprehensive AO classification would result in better agreement among
reviewers of radiographs and have more consistent repeatability than the Denis
classification.
Nineteen fellowship-trained spine surgeons evaluated data on
thirty-one separate injuries of the thoracolumbar spine. Thirteen of the
surgeons were orthopaedic surgeons, and six were neurosurgeons. All
participants received an anonymously labeled computerized disk containing the
histories and results of the physical examinations of all thirty-one patients.
Each case file also contained standard anteroposterior and lateral radiographs
of the spine made with the patient supine as well as computerized tomography
scans that included axial images made at 2-mm intervals as well as standard
sagittal plane reconstructions at 3-mm sections.
Each participant was well versed in the two classification schemes and was
provided with the original articles by
Denis3 and Magerl et
al.1. The
participants classified the fractures according to each system (see Appendix)
on separate forms. The fractures ranged from T2 superiorly to L5 inferiorly.
Ten of the thirty-one fractures were in the thoracic spine (T2 to T10),
thirteen were within the thoracolumbar junction (T11 to L2), and eight were
within the lower lumbar spine (L3, L4, and L5). The forms were then mailed to
the senior author (K.B.W.), who analyzed the data to determine the
interobserver agreement.
Approximately three months later, the same participants received a second
computerized disk with the original files scrambled randomly into a different
order, but with the same presentation. Each surgeon, again blinded to the
patients' identities, once more classified the fractures according to each
system. These data were used to determine the intraobserver agreement.
The interobserver agreement for the AO system was first measured at the
type level (A, B, or C), then at the subtype level (nine subtypes: A1, A2, A3,
B1, B2, B3, C1, C2, and C3), and finally at the subdivision level (fifty-five
subdivisions: for example, A1.2.3, B3.2.1, and so on). The interobserver
agreement for the Denis classification was first measured at the type level
(compression, burst, seat-belt, and fracture-dislocation) and then at the
subtype level (sixteen subtypes: for example, compression A, burst B, and so
on). There is no third or fourth level or subdivision in the Denis system.
Statistical Methods
Statistical analysis was performed with use of the Cohen kappa (?)
test to determine both interobserver and intraobserver agreement for the two
systems. As in previous
studies6, we used
the guidelines of Landis and
Koch7 to categorize
the kappa values, with 0.00 to 0.20 indicating slight reliability; 0.21 to
0.40, fair reliability; 0.41 to 0.60, moderate reliability; 0.61 to 0.80,
substantial agreement; and 0.81 to 1.00, almost perfect agreement. It has been
suggested that fracture classification systems should ideally have a ?
value of >0.55 for interobserver
reliability6,8.
Intraobserver error was measured with use of ? values as well as the
percent agreement between the two readings of the nineteen observers.
When the AO classification scheme was used, forty-eight different
fracture patterns were recorded by at least one surgeon. Sixty-six percent
(388) of the 589 classifications were described as type A (compression); 24%
(142), as type B (distraction); and 10% (fifty-nine), as type C (rotational
fracture-dislocation). Two hundred and eighty-nine type-A classifications (49%
of the total of 589 classifications) were considered to be of the simple
compression-fracture subtype (A1 or A2) and 100 (17% of 589) were of the burst
subtype (A3). The most common fracture subdivision was A1.2.1 (superior wedge
compression fracture), used 15% of the time, followed by A3.1.1 (superior
incomplete burst fracture), used 11% of the time.
Forty-one percent (242) of the 589 Denis classifications were designated as
the burst type, and 31% (183), as the compression type. The two together total
72%, which is similar to the 66% rate of type-A classifications with the AO
system. However, the observers labeled a compression-type injury as a burst
fracture more frequently with the Denis system than they did with the AO
system (41% compared with 17%) (p = 0.001).
When the interobserver agreement was evaluated for the AO system at its
simplest level—i.e., according to type (A, B, or C)—the ?
value was 0.475 (range, 0.389 to 0.598). This is considered moderate
reliability (Figs. 1-A,
1-B,
1-C,
2-A, 2-B,
2-C,
2-D,
2-E). When the observers used
the Denis scheme at its simplest level—also according to type
(compression, burst, seat-belt, or fracture-dislocation)—the ?
value was 0.606 (range, 0.395 to 0.702) This is considered to border between
moderate and substantial agreement.
When the level of agreement for the AO system was analyzed according to the
first subtype level (A1, A2, A3, B1, and so on), the ? value was 0.537
(range, 0.331 to 0.685), which indicated moderate reliability. When the level
of agreement was analyzed for the subtypes within the Denis system
(compression A through D, burst A through E, and so on), the ? value was
0.173 (range, 0 to 0.485), which indicated slight reliability.
Three months after their first assessments, the nineteen surgeons returned
their classifications of the randomly reordered images of the fractures so
that the test-retest intraobserver repeatability of the two schemes could be
assessed. There was 82% intraobserver agreement, with a ? value of 0.63
(substantial reliability), regarding the AO classification of the fractures as
type A, B, or C. The percent agreement regarding the AO subtypes (A1, A2, and
so on) was 67%. The Denis classification scheme demonstrated a similar
repeatability of 79%, with a ? value of 0.71 (substantial reliability),
when the observers used it to rate the fractures according to type
(compression, burst, and so on). At the subtype level (compression A through
D, burst A through E, and so on) the percent agreement was 56%. At the subtype
level, neither the AO nor the Denis classification allowed sufficient
comparison to render a meaningful kappa statistic.
Classification schemes should be useful to clinicians who are
interested in not only providing treatment, but also understanding the basic
pathomechanisms involved in an injury. In fractures of the thoracolumbar
spine, however, so many variables are involved in describing the anatomy and
pathomechanisms, ranging from individual soft tissue and osseous disruption to
the severity of the impact force at the time of injury, that most
classification systems have problems. In general, if the scheme is exceedingly
simple, there is loss of information. If the system is all-inclusive, it
becomes difficult to use and poorly reproducible between observers.
In our study, well-trained spine surgeons demonstrated only high-moderate
reliability when they used either the Denis or the AO scheme at its simplest
level to classify thoracic and lumbar fractures. When they used the schemes to
classify the fractures with more sophistication, the AO system was found to be
only moderately reliable and the Denis system was only slightly reliable.
The balance between simplicity and inclusiveness is a delicate one. We
believe that, despite its inclusiveness, the AO system is somewhat complicated
to use as it forces the observer to analyze many more variables than were
required by prior systematic, mechanistic systems. On the other hand, with the
Denis classification system, many of the surgeons involved in this study
thought that, because of its simplicity, they had no accurate choice for many
of the given fractures. Because posterior element disruption can be visualized
more readily with modern technology (computerized tomography) than it could
when the Denis system was first proposed, it may now be difficult to classify
a complex fracture with such a simple scheme.
The results of the intraobserver comparison are of particular interest.
Three months after their first assessments, the nineteen spine surgeons, using
the classification schemes at their simplest level, graded the same fractures
differently from the way they had graded them initially 18% of the time with
use of the AO types and 21% of the time with use of the Denis types.
Our results are similar to those of other studies. Blauth et
al.5 sent fourteen
case files to twenty-two clinics specializing in the treatment of spinal
injury and found relatively high agreement with regard to the classification
of simple compression injuries; however, as the injuries became more complex,
there was more disagreement. Oner et
al.6 studied the
interobserver and intraobserver reliability of both the AO and the Denis
systems by assessing the classifications of fifty-three fractures by five
observers. Use of the AO scheme to assign fracture type was found to have a
relatively low ? value (0.35) for interobserver reliability (fair
reliability) and a ? value of 0.41 for intraobserver reliability
(borderline moderate reliability), whereas use of the Denis system to
determine the fracture type had a ? value of 0.60 for interobserver
reliability (borderline substantial reliability) and 0.45 for repeatability.
Our study of a much larger group of participants, including both orthopaedic
surgeons and neurosurgeons, demonstrated fairly similar results and emphasizes
the difficulties involved with the use of even the most modern classification
systems.
We believe that, in its current form, the AO system is certainly inclusive
but is not sufficiently reliable for comparison of spinal fractures; in
addition, its complexity may be somewhat impractical for day-to-day clinical
use. Furthermore, we believe that the Denis system is simple, but quite
incomplete, and that the importance of imaging of the posterior ligamentous
complex with modern technology is not adequately taken into account.
The relatively low intraobserver reliability is noteworthy. It raises
questions about the accuracy and relevance of clinical studies in which these
classifications were used to assess and treat thoracolumbar injuries.
Additional scrutiny and testing of modern classification schemes should aim to
identify the primary parameters that have been shown, through clinical use and
biomechanical testing, to most accurately predict natural history and guide
treatment options while maintaining reliability.
Figures presenting the details of both fracture classifications are
available with the electronic versions of this article, on our web site at
(go to
the article citation and click on "Supplementary Material") and on
our quarterly CD-ROM (call our subscription department, at 781-449-9780, to
order the CD-ROM). ?
The authors did not receive grants or outside funding in support of their
research or preparation of this manuscript. They did not receive payments or
other benefits or a commitment or agreement to provide such benefits from a
commercial entity. No commercial entity paid or directed, or agreed to pay or
direct, any benefits to any research fund, foundation, educational
institution, or other charitable or nonprofit organization with which the
authors are affiliated or associated.
Magerl F, Aebi M, Gertzbein SD, Harms J,
Nazarian S. A comprehensive classification of thoracic and lumbar injuries.
Eur Spine J.1994;3:
184-201.3184Â
1994Â
[PubMed][CrossRef] Â
Mirza SK, Mirza AJ, Chapman JR, Anderson
PA. Classifications of thoracic and lumbar fractures: rationale and supporting
data. J Am Acad Orthop Surg.2002;10:
364-77.10364Â
2002Â
[PubMed] Â
Denis F. The three column spine and its
significance in the classification of acute thoracolumbar spinal injuries.
Spine.1983;8:
817-31.8817Â
1983Â
[PubMed][CrossRef] Â
Louis R. [Unstable fractures of the
spine. III. Instability. Theories concerning instability]. Rev Chir
Orthop Reparatrice Appar Mot.1977;63: 423-5.
French.63423Â
1977Â
Â
Blauth M, Bastian L, Knop C, Lange U,
Tusch G. [Inter-observer reliability in the classification of thoraco-lumbar
spinal injuries]. Orthopade.1999;28: 662-81.
German.28662Â
1999Â
[PubMed][CrossRef] Â
Oner FC, Ramos LM, Simmermacher RK,
Kingma PT, Diekerhof CH, Dhert WJ, Verbout AJ. Classification of thoracic and
lumbar spine fractures: problems of reproducibility. A study of 53 patients
using CT and MRI. Eur Spine J.2002;11:
235-45.11235Â
2002Â
[PubMed][CrossRef] Â
Landis JR, Koch GG. The measurement of
observer agreement for categorical data. Biometrics.1977;33:
159-74.33159Â
1977Â
[PubMed][CrossRef] Â
Sanders R. Editorial. The problem with
apples and oranges. J Orthop Trauma.1997;11:
465-6.11465Â
1997Â
[CrossRef] Â
Oner FC. Thoracolumbar spine
fractures: diagnostic and prognostic parameters. Nederlands:
Universiteit Utrecht; 1956.Â
1956Â
Â