用于诊断正常胸部X射线的人工智能解决方案的鲁棒性

论文标题

用于诊断正常胸部X射线的人工智能解决方案的鲁棒性

Robustness of an Artificial Intelligence Solution for Diagnosis of Normal Chest X-Rays

论文作者

Dyer, Tom, Smith, Jordan, Dissez, Gaetan, Tay, Nicole, Malik, Qaiser, Morgan, Tom Naunton, Williams, Paul, Garcia-Mondragon, Liliana, Pearse, George, Rasalingham, Simon

论文摘要

目的：用于医学诊断的人工智能（AI）解决方案需要进行彻底评估，以证明所有患者子组都保持了绩效，并确保将公平地提供建议的改善。这项研究通过比较多个患者和环境亚组的性能，以及将AI错误与人类专家的差异进行比较，评估了AI解决方案在诊断正常胸部X射线（CXR）方面的鲁棒性。方法：对总共4,060个CXR进行了采样，以代表NHS患者和护理环境的不同数据集。地面真实标签由三辐射学家小组分配。针对分配的标签评估了AI性能，并针对患者年龄和性别进行了亚组分析，以及CXR视图，模式，设备制造商和医院现场。结果：AI解决方案能够通过分类为高置信度（HCN）来删除18.5％的数据集。这与放射科医生诊断正常扫描的负预测值（NPV）为96.0％有关。在所有AI假阴性（FN）情况下，与最终地面真相标签相比，发现放射科医生也犯了相同的错误。亚组分析显示，AI性能没有统计学上的显着变化，而在某些医院部位的数据中观察到了正常分类的降低。结论：我们表明，AI解决方案可以通过诊断为HCN的18.5％的扫描量可提供有意义的工作量节省，而NPV优于人类读者。证明AI溶液在患者亚组中表现良好，并且错误病例表明本质上是主观或微妙的。

Purpose: Artificial intelligence (AI) solutions for medical diagnosis require thorough evaluation to demonstrate that performance is maintained for all patient sub-groups and to ensure that proposed improvements in care will be delivered equitably. This study evaluates the robustness of an AI solution for the diagnosis of normal chest X-rays (CXRs) by comparing performance across multiple patient and environmental subgroups, as well as comparing AI errors with those made by human experts. Methods: A total of 4,060 CXRs were sampled to represent a diverse dataset of NHS patients and care settings. Ground-truth labels were assigned by a 3-radiologist panel. AI performance was evaluated against assigned labels and sub-groups analysis was conducted against patient age and sex, as well as CXR view, modality, device manufacturer and hospital site. Results: The AI solution was able to remove 18.5% of the dataset by classification as High Confidence Normal (HCN). This was associated with a negative predictive value (NPV) of 96.0%, compared to 89.1% for diagnosis of normal scans by radiologists. In all AI false negative (FN) cases, a radiologist was found to have also made the same error when compared to final ground-truth labels. Subgroup analysis showed no statistically significant variations in AI performance, whilst reduced normal classification was observed in data from some hospital sites. Conclusion: We show the AI solution could provide meaningful workload savings by diagnosis of 18.5% of scans as HCN with a superior NPV to human readers. The AI solution is shown to perform well across patient subgroups and error cases were shown to be subjective or subtle in nature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题