Interobserver reliability for the classification of proximal humeral fractures is limited. The aim of this study was to test the null hypothesis that interobserver reliability of the AO classification of proximal humeral fractures, the preferred treatment, and fracture characteristics is the same for two-dimensional (2-D) and three-dimensional (3-D) computed tomography (CT).Methods:
Members of the Science of Variation Group—fully trained practicing orthopaedic and trauma surgeons from around the world—were randomized to evaluate radiographs and either 2-D CT or 3-D CT images of fifteen proximal humeral fractures via a web-based survey and respond to the following four questions: (1) Is the greater tuberosity displaced? (2) Is the humeral head split? (3) Is the arterial supply compromised? (4) Is the glenohumeral joint dislocated? They also classified the fracture according to the AO system and indicated their preferred treatment of the fracture (operative or nonoperative). Agreement among observers was assessed with use of the multirater kappa (κ) measure.Results:
Interobserver reliability of the AO classification, fracture characteristics, and preferred treatment generally ranged from “slight” to “fair.” A few small but statistically significant differences were found. Observers randomized to the 2-D CT group had slightly but significantly better agreement on displacement of the greater tuberosity (κ = 0.35 compared with 0.30, p < 0.001) and on the AO classification (κ = 0.18 compared with 0.17, p = 0.018). A subgroup analysis of the AO classification results revealed that shoulder and elbow surgeons, orthopaedic trauma surgeons, and surgeons in the United States had slightly greater reliability on 2-D CT, whereas surgeons in practice for ten years or less and surgeons from other subspecialties had slightly greater reliability on 3-D CT.Conclusions:
Proximal humeral fracture classifications may be helpful conceptually, but they have poor interobserver reliability even when 3-D rather than 2-D CT is utilized. This may contribute to the similarly poor interobserver reliability that was observed for selection of the treatment for proximal humeral fractures. The lack of a reliable classification confounds efforts to compare the outcomes of treatment methods among different clinical trials and reports.Level of Evidence:
Diagnostic Level III. See Instructions for Authors for a complete description of levels of evidence.