Eight members of the Scoliosis Research Society participated in this study. Four surgeons represented two centers, whereas the other four surgeons, one of whom was Dr. H. A. King, were from different scoliosis centers. At the time of the study, all of the participants were active scoliosis surgeons but all had different backgrounds with regard to their training in the operative treatment of scoliosis. Five of us (L. G. L., R. R. B., D. H. C., J. H., and T. G. L.) are members of the Harms International Scoliosis Study Group.
The lead author (L. G. L.) chose twenty-seven full-length, good-quality radiographs that had been made before operative treatment of adolescent idiopathic scoliosis. The images included posteroanterior and lateral radiographs made with the patient standing as well as right and left forced-side-bending radiographs made with the patient supine. All curves had been treated during a two-year time-span during which the lead author personally operated on approximately sixty patients who had adolescent idiopathic scoliosis.
The mean coronal Cobb angle for the twenty-seven curves was 64 degrees (range, 45 to 105 degrees). Six of the twenty-seven thoracic curves had a Cobb angle of more than 75 degrees. Each radiograph was labeled with the Cobb angle for each coronal curve, with the value for the thoracic kyphosis as measured from the cephalad end plate of the fifth thoracic vertebra to the caudad end plate of the twelfth thoracic vertebra, and with the value for the lumbar lordosis as measured from the cephalad end plate of the twelfth thoracic vertebra to the cephalad end plate of the sacrum. All measurements were performed by an experienced scoliosis nurse-clinician and were repeated by the lead author to document the consistency of the measured values.
Each radiograph was then photographed in order to convert it into a standard-size projection-type slide. The slides were projected onto a screen so that the axial skeleton was at least thirty-six inches (91.4 centimeters) long so as to mimic the actual size of a long-cassette radiograph. None of the slides were deemed uninterpretable because of poor quality.
Each reviewer was provided with a diagrammatic summary of the five types of curves according to the classification system of King et al. taken directly from the original article9. Also included was a description of the flexibility index, which was defined as the percentage of correction of the thoracic curve subtracted from the percentage of correction of the lumbar curve, as analyzed on side-bending radiographs. In addition, a table that described the five types of curves was included as a reference, and a note describing King's revised criteria for type-II curves was provided as well8. To distinguish between type-II and double-major curves, the reviewers also used comparative analysis of the ratios of the Cobb angles, apical vertebral translation, and rotation between the thoracic and lumbar regions, as well as the flexibility seen on side-bending radiographs12.
Each reviewer had a preprinted form on which to record his specific classification for each of the curves presented. The reviewers determined whether the curve was type I, II, III, IV, or V; thoracolumbar; lumbar; double major; or other (Trial 1). Five of the reviewers participated in a group setting, during meetings of the Harms International Scoliosis Study Group, and three (K. H. B., H. L. S., and Dr. King) participated on an individual basis at their home institution. All of the reviewers had ample time to review the radiographs; no time-limit was imposed on the review. The five reviewers who participated in the group setting worked independently throughout the review process. No discussion was allowed during the time between the presentations of the curves, and the reviewers sat far enough away from each other so that there was no opportunity to view each other's responses.
The same twenty-seven curves were reviewed again in a different order in a group setting at a second viewing (Trial 2) by the same five reviewers who had participated in a group setting in Trial 1. The reviewers were asked again to classify each curve. Therefore, Trial 2 was used to test the intraobserver reliability (reproducibility) of the results from Trial 1. None of the curves were reviewed by the group as a whole until both trials had been performed.
Kappa statistics were used to analyze the data with both simple and weighted components and were compared with use of 95 per cent confidence intervals established with SAS software (Statistical Analysis System, Cary, North Carolina). These statistics determine the proportion of agreement that occurs by random chance subtracted from the actual portion of agreement that was obtained. Kappa coefficients range in value from +1.0 (perfect agreement) to 0.0 (chance agreement) to -1.0 (less agreement than expected by chance). Kappa statistics were used to quantify both interobserver reliability (the results of the evaluations performed by seven of the reviewers compared with those of Dr. King in Trial 1) and intraobserver reliability (the results in Trial 1 compared with those in Trial 2 for each of the five reviewers who participated in both trials). In addition, kappa coefficients were generated to compare each reviewer's results with those of the other reviewers within the group. According to Svanholm et al., a kappa coefficient of more than 0.75 indicates excellent reliability; a value of 0.50 to 0.75, fair reliability; and a value of less than 0.50, poor reliability.
The mean interobserver reliability was only 64 per cent (range, 54 to 77 per cent) when the responses of Dr. King were compared with those of the other seven reviewers (a total of 189 comparisons). The mean kappa coefficient for interobserver reliability was 0.49 (range, 0.27 to 0.73). All eight reviewers agreed on the classification of two (7 per cent) of the twenty-seven curves. Two classifications were listed for four curves (15 per cent), three were listed for eighteen curves (67 per cent), four were listed for one curve (4 per cent), and five were listed for two curves (7 per cent).
When the responses of each reviewer were compared with those of the rest of the group (a total of 189 comparisons), the mean interobserver reliability was 55 per cent (range, 33 to 81 per cent) and the mean kappa coefficient was 0.40 (range, 0.21 to 0.63).
The mean intraobserver reliability (a total of 135 comparisons) was 69 per cent (range, 56 to 85 per cent), and the mean kappa coefficient was 0.62 (range, 0.34 to 0.95).
According to the criteria proposed by Svanholm et al., the mean interobserver reliability was poor when the responses of seven of the reviewers were compared with those of Dr. King (? = 0.49) as well as when the responses of each reviewer were compared with those of the other reviewers (? = 0.40). The mean intraobserver reliability was fair (? = 0.62) when the results in Trial 1 were compared with those in Trial 2.
The most common difficulties encountered by the reviewers included distinction of a type-II curve from a double major curve (seven cases of disagreement); identification of a primary thoracic curve when the fourth lumbar vertebra was tilted into the curve or when the lumbar curve had structural characteristics of its own, as evidenced by rotation of the lumbar curve opposite to that of the thoracic spine (two cases of disagreement); distinction of a type-II curve from a type-III curve on the basis of the amount of rotation of the lumbar curve and its deviation from the midline (seven cases of disagreement); identification of structural cephalad thoracic curves that were not classic type-V curves (seven cases of disagreement); and distinction of a type-I curve from a true double major curve (four cases of disagreement).
Classification schemes help clinicians to organize their thoughts with regard to the type of problem that is being treated and to design appropriate methods of treatment. Thus, classification systems not only organize an approach to a problem and suggest a method of treatment but also may provide an estimate of the outcome of a particular treatment7. It is becoming increasingly clear, however, that various classification schemes in all areas of medicine, including orthopaedic surgery, may not fulfill the basic requirements that are necessary in order for them to be considered valid. Specifically, for a classification system to be useful, intraobserver and interobserver reliability must be proved. Different practitioners must agree on the classification of the data on a particular patient (interobserver reliability), and a practitioner must assign the same classification every time the data on that patient are reviewed (intraobserver reliability or repeatability). Only after intraobserver and interobserver reliability has been confirmed can an attempt be made to prove that a classification scheme is useful for guiding treatment and testing clinical outcomes5. If the validity of widely utilized classification schemes is in doubt, the role of multicenter analyses and the results of studies in which these classification schemes have been used become suspect.
During reviews of radiographs of scoliotic curves at meetings of the Harms International Scoliosis Study Group, it became evident that there was a great deal of disagreement among the surgeons regarding the appropriate classification, according to the system of King et al.9, of most of the curves. This disagreement prompted the current study. The main purpose of the present study was to determine the reliability of the classification of thoracic adolescent idiopathic scoliosis described by King et al. Although this system has been the primary method for the classification of thoracic adolescent idiopathic curves, to our knowledge its reliability has never been determined. In 1991, at the annual meeting of the Scoliosis Research Society, Lonstein reported the results of a multicenter study of twenty-nine scoliotic curves that had had coronal decompensation after arthrodesis with Cotrel-Dubousset instrumentation. In that study, twelve spine surgeons were asked to classify the curves and to select the appropriate levels of arthrodesis on the preoperative radiographs. There was general agreement regarding the classification of eleven (38 per cent) of the curves; however, the agreement varied for the remaining eighteen (62 per cent). No specific interobserver reliability data were provided in that study of obviously difficult curves.
The problem with the reliability of the classification of thoracic adolescent idiopathic scoliosis in the current study is similar to that seen in other recent evaluations of classification schemes used in orthopaedic surgery5,10. Specifically, the interobserver reliability in the present study was only 64 per cent when the responses of seven of the reviewers were compared with those of Dr. King. When the possibility that agreement with regard to the classification was due to statistical chance alone was eliminated, the kappa coefficient was 0.49, which is considered borderline poor reliability21. The findings were similar when the responses of each reviewer were compared with those of the other reviewers: the interobserver reliability was 55 per cent and the kappa coefficient was 0.40, which also indicates poor reliability.
The intraobserver reliability was somewhat better: 69 per cent with a kappa coefficient of 0.62, which is considered fair reliability21.
There were several common problems that appeared to produce the inconsistent responses and suboptimum reliability in the present study. The reviewers found it difficult to distinguish between a type-I curve and a true double major curve (four cases of disagreement). With a type-I scoliosis, the lumbar curve has a greater Cobb angle and is less flexible than the thoracic curve9. Classically, with a double major curve, the thoracic and lumbar curves have nearly equal Cobb angles, apical rotation, and deviation from the midline. However, there are no strict criteria with regard to how different the structural characteristics of the thoracic and lumbar curves must be in order to differentiate a double major curve from a type-I curve. The precise amount by which the Cobb angle of the lumbar curve must exceed that of the thoracic curve in order for the curve to be classified as type I is unclear (Figs. 1-A, 1-B, and 1-C). Often, when the lumbar curve has a greater Cobb angle, the thoracic curve is less flexible because of interposition of the thoracic rib cage and the sternum. The distinction between type-I and double major curves may not be necessary if the curves are to be treated in a similar manner. However, if the treatment modalities differ—for example, if an anterior release and arthrodesis is performed for a larger and stiffer lumbar curve—then this distinction becomes more important, especially when the results of such treatment are analyzed on the basis of the preoperative classification of the curves.
Another frequent problem experienced by the reviewers in the present study involved distinguishing between a type-II curve and a type-III curve on the basis of specific structural characteristics of the lumbar curve when the patient had a larger, more structural thoracic curve (seven cases of disagreement). The difficulty in determining exactly how far the lumbar curve was situated from the midline and how much rotation was present led to disagreement regarding whether these curves should be classified as type II or type III (Figs. 2-A, 2-B, and 2-C). Although the degree of deviation from the midline that distinguishes a type-II curve from a type-III curve was not strictly quantified in the original classification system of King et al.9, in a subsequent article8 King stated that the lumbar curve crosses the midline in type II and does not cross the midline in type III. The clinical examination may also provide important clues in that a lumbar hump is present in a type-II curve and is virtually absent in a type-III curve8.
However, some curves cannot be classified according to this strict definition because the position of the apex of the lumbar curve is somewhere between those of type-II and type-III curves. An example is a 55-degree right thoracic and 40-degree left lumbar curve with the thoracic curve having more deviation and more rotation than the lumbar curve but the medial apex of the lumbar curve lying directly on the center sacral line with contralateral rotation of the lumbar spine compared with that of the thoracic spine (Fig. 3). It is difficult to pinpoint whether that curve should be classified as type II or type III because of the intermediate position of the lumbar portion. This is a challenging curve to evaluate because almost invariably the thoracic curve is more structural than the lumbar curve, even if the curves have similar Cobb angles, because of the rigidity contributed by the thoracic rib cage. However, this type of curve is fairly common and thus must be classified appropriately.
Similarly, our reviewers had difficulty distinguishing a type-II curve from a double major curve (seven cases of disagreement). This difficulty appeared to be due to variability in the comparison of the extent of structural characteristics of the thoracic spine with those of the lumbar spine; in addition, some reviewers placed added importance on the alignment of the thoracolumbar junction in the sagittal plane when deciding whether both the thoracic and the lumbar curve should be considered structural curves. Certainly, there is still controversy regarding the strict definition of a type-II curve that can be successfully treated with selective thoracic arthrodesis4,8,12,15,17,18,22. Thus, the type-II curves that were appropriate for selective thoracic arthrodesis in the coronal plane were often classified as double major if a reviewer believed that both curves needed arthrodesis with instrumentation because of a thoracolumbar kyphosis in the sagittal plane (Figs. 4-A, 4-B, 4-C through 4-D). Such a kyphosis was not accounted for in the classification system of King et al.9.
Use of the flexibility index alone can be somewhat misleading because most thoracic curves with a Cobb angle that is nearly equal to that of the lumbar curve are inherently less flexible because of the interposition of the thoracic rib cage and the sternum. Thus, comparative analysis of the ratio of the thoracic Cobb angle to the lumbar Cobb angle, apical vertebral translation, and rotation, as well as the flexibility seen on side-bending radiographs, has also been used to distinguish a type-II curve from a double major curve12,18. However, other authors still classify a curve as type II even if they perform instrumentation and arthrodesis of both the thoracic and the lumbar curve1. Since one of the objectives of any classification system should be to direct decisions regarding treatment7, we recommend classification of a curve as type II when selective thoracic arthrodesis can be performed successfully, as recommended by King et al.9.
Another situation that caused disagreement with regard to the classification was the presence of a primary thoracic curve with the fourth lumbar vertebra tilted into it as well as a lumbar curve that had structural characteristics of its own as evidenced by rotation of the lumbar curve opposite to the rotation of the thoracic spine and inflexibility on side-bending (two cases of disagreement). These curves were often classified as type II, III, or IV by the reviewers (Figs. 5-A, 5-B, and 5-C). Although not very common, this pattern of reversed rotation with little or no contralateral deviation of the lumbar curve from the midline was also not well defined in the classification system of King et al.9.
The reviewers also had trouble identifying structural cephalad thoracic curves that were not classic type-V curves (seven cases of disagreement). According to the classic definition, a patient is considered to have a type-V curve if positive tilt of the first thoracic vertebra is seen on a posteroanterior radiograph, made with the patient standing, and the left shoulder is higher than the right as seen clinically. This dictates treatment of two structural thoracic curves and the inclusion of the cephalad thoracic curve in the instrumentation and arthrodesis of the main thoracic curve. However, it has been shown that the cephalad thoracic curve may be structural even when there is neutral or negative tilt of the first thoracic vertebra and the shoulders are level or the right shoulder is elevated. The diagnosis of a structural cephalad thoracic curve depends on various radiographic and clinical parameters11,14, including a Cobb angle of more than 30 degrees as measured on the posteroanterior radiograph, side-bending flexibility (a residual cephalad thoracic curve of more than 20 degrees), rotation (at least grade 1 according to the system of Nash and Moe), and at least one centimeter of deviation of the apex of the cephalad thoracic curve from the midline. Thus, some structural cephalad thoracic curves are not classic type-V curves. These curves fit more than one classification in the system of King et al.9. In fact, any type of curve may be accompanied by a structural cephalad thoracic curve that needs instrumentation and arthrodesis11,14 (Figs. 6-A, 6-B, and 6-C). It is unclear whether all such curves should be classified as type V or if two classifications (for example, type IV and type V) should be listed because the curve satisfies the criteria for both.
King et al.9 developed their classification system during the era of Harrington instrumentation and arthrodesis. During that time, unidimensional (coronal) assessment was the principal manner in which curves were classified and appropriate treatment was recommended. With the advent of segmental spinal fixation, three-dimensional analysis of scoliosis has become routine. Thus, it appears logical that a classification scheme that promotes three-dimensional analysis of the scoliotic deformity would provide additional information with which to classify these curves and would help physicians to determine appropriate three-dimensional treatment. All of the patients in this study had posteroanterior and lateral radiographs made, while they were standing, with use of a thirty-six by sixteen-inch (91.4-centimeter by 40.6-centimeter) cassette, and all had side-bending flexibility radiographs made, as part of the standard preoperative analysis. Lateral radiographs were not used by King et al. to develop their classification; however, the classification of scoliotic curves and decisions regarding treatment could potentially be based on various features (such as thoracolumbar kyphosis) noted on these radiographs. In addition, there will always be some overlap between the characteristics of thoracic curves and other types of curves (such as primary thoracolumbar or lumbar curves) in adolescent idiopathic scoliosis. Thus, the classification system of King et al. is not inclusive of all types of adolescent idiopathic scoliosis and therefore does not allow comprehensive evaluation of the various patterns that are seen.
The present study has some shortcomings. First, there may have been an overall reviewer bias as five of the eight reviewers were members of a scoliosis study group. However, the reviewers represented a wide geographical distribution, were all active members of the Scoliosis Research Society, and had different training and backgrounds. Second, the curves were preselected by the lead author and many were chosen because he had had difficulty classifying them himself. Thus, it is certainly possible that the reliability data would have been better if more straightforward types of curves had been analyzed. However, all twenty-seven of the curves were treated in a two-year time-span and reflect the type of idiopathic curves that were treated operatively during that time.
One of the strongest criticisms of this study could be that the reviewers lacked an accurate understanding of the proper use of the classification system of King et al.9. In other words, a problem with the reliability of the classification may be inherent to the education of the reviewers and not an inherent weakness of this or other methods of classification. We tried to minimize this possibility by familiarizing the reviewers with the classification both in writing and schematically. The reviewers had used this system to classify all of their patients prospectively for data analysis during the four years preceding the study and thus had used it routinely in their practices. However, it was not determined if the reviewers used the classification in a manner that reflected complete understanding of the system. Incomplete understanding may have contributed to the poor reliability data. Additionally, for practical purposes, slides of radiographs instead of actual long-cassette radiographs were used. Although these slides were of sufficiently good quality, in reality preoperative decisions are usually made on the basis of long-cassette radiographs, not slides of radiographs. In addition, clinical examination, which was not a part of this analysis, can be extremely helpful in determining the type of curve and the ultimate decisions regarding treatment.
The goal of operative of treatment of thoracic adolescent idiopathic scoliosis is safe correction of the deformity with spinal instrumentation and a solid fusion. However, the definition of a successful result is quite controversial and depends on both clinical and radiographic parameters. Currently, there is no appropriate radiographic scoring system for the objective comparison of the results of operative treatment of idiopathic scoliosis performed by one surgeon with those of the operations performed by another. If a reliable classification system that is able to direct appropriate treatment is developed, then it will be possible to objectively compare the results of operative treatment of similar curves among surgeons in order to determine the ideal treatment and to direct future outcome studies in this field.
In the present study, we found poor-to-fair interobserver and intraobserver reliability with use of the system of King et al.9 for the classification of thoracic adolescent idiopathic scoliosis. Variable interpretations of the classification were noted. In addition, we identified several recurring problems that led to less-than-desirable reliability, and these inconsistencies may interfere with a thorough evaluation of the spinal deformity. Thus, we recommend caution when the results of operations for thoracic adolescent idiopathic scoliosis are compared on the basis of the classification system of King et al.9. Appropriate analysis of operative outcome data is possible only if the classification of the curves is reliable.
NOTE: The authors thank Bradley Wilson, M.A., Division of Biostatistics, Washington University, for statistical analysis; Howard A. King, M.D., for his participation; and Lutz Biederman, for his support of the study group.