论文标题
用于诊断正常胸部X射线的人工智能解决方案的鲁棒性
Robustness of an Artificial Intelligence Solution for Diagnosis of Normal Chest X-Rays
论文作者
论文摘要
目的:用于医学诊断的人工智能(AI)解决方案需要进行彻底评估,以证明所有患者子组都保持了绩效,并确保将公平地提供建议的改善。这项研究通过比较多个患者和环境亚组的性能,以及将AI错误与人类专家的差异进行比较,评估了AI解决方案在诊断正常胸部X射线(CXR)方面的鲁棒性。 方法:对总共4,060个CXR进行了采样,以代表NHS患者和护理环境的不同数据集。地面真实标签由三辐射学家小组分配。针对分配的标签评估了AI性能,并针对患者年龄和性别进行了亚组分析,以及CXR视图,模式,设备制造商和医院现场。 结果:AI解决方案能够通过分类为高置信度(HCN)来删除18.5%的数据集。这与放射科医生诊断正常扫描的负预测值(NPV)为96.0%有关。在所有AI假阴性(FN)情况下,与最终地面真相标签相比,发现放射科医生也犯了相同的错误。亚组分析显示,AI性能没有统计学上的显着变化,而在某些医院部位的数据中观察到了正常分类的降低。 结论:我们表明,AI解决方案可以通过诊断为HCN的18.5%的扫描量可提供有意义的工作量节省,而NPV优于人类读者。证明AI溶液在患者亚组中表现良好,并且错误病例表明本质上是主观或微妙的。
Purpose: Artificial intelligence (AI) solutions for medical diagnosis require thorough evaluation to demonstrate that performance is maintained for all patient sub-groups and to ensure that proposed improvements in care will be delivered equitably. This study evaluates the robustness of an AI solution for the diagnosis of normal chest X-rays (CXRs) by comparing performance across multiple patient and environmental subgroups, as well as comparing AI errors with those made by human experts. Methods: A total of 4,060 CXRs were sampled to represent a diverse dataset of NHS patients and care settings. Ground-truth labels were assigned by a 3-radiologist panel. AI performance was evaluated against assigned labels and sub-groups analysis was conducted against patient age and sex, as well as CXR view, modality, device manufacturer and hospital site. Results: The AI solution was able to remove 18.5% of the dataset by classification as High Confidence Normal (HCN). This was associated with a negative predictive value (NPV) of 96.0%, compared to 89.1% for diagnosis of normal scans by radiologists. In all AI false negative (FN) cases, a radiologist was found to have also made the same error when compared to final ground-truth labels. Subgroup analysis showed no statistically significant variations in AI performance, whilst reduced normal classification was observed in data from some hospital sites. Conclusion: We show the AI solution could provide meaningful workload savings by diagnosis of 18.5% of scans as HCN with a superior NPV to human readers. The AI solution is shown to perform well across patient subgroups and error cases were shown to be subjective or subtle in nature.