The substantial incidence of undetected and incorrectly identified femoral neck fractures (FNF) highlights the need for improved diagnostic support. This work evaluates whether artificial intelligence (AI) can reliably identify FNF and compares its diagnostic capacity with that of physicians. Additionally, it examines how clinicians’ performance changes when assisted by AI. A total of 4477 hip radiographs—2884 showing FNF and 1593 appearing normal—were gathered from eight leading tertiary hospitals in China (Union Hospital, Tongji Medical College, Huazhong University of Science and Technology; Wuhan Union Hospital; Wuhan Pu’ai Hospital; Tianyou Hospital, Wuhan University of Science and Technology; Hanyang Hospital, Wuhan University of Science and Technology; Northern Jiangsu People's Hospital; Xiangya Changde Hospital; People’s Hospital of Tibet Autonomous Region; Second Affiliated Hospital of Soochow University) to form a multicenter dataset. After annotation, the images were divided into 4029 for training and 448 for testing. A Faster R-CNN framework using three backbone networks (VGG16, VGG16-nottop, and ResNet-50) was built and trained. Performance on the test set—accuracy, sensitivity, specificity, missed-diagnosis rate, misdiagnosis rate, PPV, NPV, and diagnostic time—was benchmarked against five clinicians. The top-performing backbone was subsequently provided to physicians as an aid to reassess the test images and determine the additive value of AI.
Among the models, ResNet-50 yielded the strongest results compared with VGG16 (lowest) and VGG16-nottop (intermediate) across accuracy (0.82 vs 0.58 and 0.76), sensitivity (0.93 vs 0.83 and 0.94), specificity (0.62 vs 0.12 and 0.43), missed-diagnosis rate (0.07 vs 0.17 and 0.06), misdiagnosis rate (0.38 vs 0.88 and 0.57), PPV (0.82 vs 0.63 and 0.75), NPV (0.82 vs 0.28 and 0.81), and diagnostic time (0.02 h vs 0.04 h and 0.03 h). Relative to clinicians, the ResNet-50 model showed higher accuracy, sensitivity, lower missed-diagnosis rate, faster interpretation, and superior NPV, though it lagged in specificity and misdiagnosis rate; PPV differences were minimal. With AI support, clinicians improved across every evaluated metric. AI represents a promising tool for detecting FNF and serves as an effective augmentation for physicians, improving diagnostic reliability.