提高医疗注释质量以减轻标签负担，使用分层嘈杂的交叉验证

论文标题

提高医疗注释质量以减轻标签负担，使用分层嘈杂的交叉验证

Improving Medical Annotation Quality to Decrease Labeling Burden Using Stratified Noisy Cross-Validation

论文作者

Hsu, Joy, Phene, Sonia, Mitani, Akinori, Luo, Jieying, Hammel, Naama, Krause, Jonathan, Sayres, Rory

论文摘要

随着机器学习越来越多地应用于医学成像数据，培训标签中的噪声已成为一个重要的挑战。医学图像诊断的可变性已很好地确定；此外，培训和对医疗标签者任务的关注的可变性可能会加剧此问题。已经研究了用于识别和减轻低质量标签影响的方法，但在医学成像任务中的表征不佳。例如，嘈杂的交叉验证将训练数据分为一半，并已被证明可以识别计算机视觉任务中的低质量标签。但是它并未专门用于医学成像任务。在这项工作中，我们引入了分层嘈杂的交叉验证（SNCV），这是嘈杂的交叉验证的扩展。 SNCV可以通过为每个示例分配质量分数来估计模型预测的信心；分层标签以处理班级失衡；并确定可能的低质量标签以分析原因。我们评估SNCV在视网膜底眼镜照片中怀疑风险的诊断SNCV的性能，视网膜眼镜照片是一项临床重要但细微差别的标签任务。使用先前发布的深度学习模型中的培训数据，我们为每个培训示例计算一个连续的质量得分（QS）。我们使用训练有素的青光眼专家进行1,277个低QS示例。新标签与初始标签上> 85％的时间的SNCV预测一致，这表明低QS示例主要反映了标签错误。然后，我们仅使用高QS标签来量化训练的影响，这表明可以在较少的示例中获得强大的模型性能。通过将方法应用于随机亚采样训练数据集，我们表明我们的方法可以将标签负担减少约50％，而在多个固定测试集中使用完整的数据集，则可以实现非上限模型性能。

As machine learning has become increasingly applied to medical imaging data, noise in training labels has emerged as an important challenge. Variability in diagnosis of medical images is well established; in addition, variability in training and attention to task among medical labelers may exacerbate this issue. Methods for identifying and mitigating the impact of low quality labels have been studied, but are not well characterized in medical imaging tasks. For instance, Noisy Cross-Validation splits the training data into halves, and has been shown to identify low-quality labels in computer vision tasks; but it has not been applied to medical imaging tasks specifically. In this work we introduce Stratified Noisy Cross-Validation (SNCV), an extension of noisy cross validation. SNCV can provide estimates of confidence in model predictions by assigning a quality score to each example; stratify labels to handle class imbalance; and identify likely low-quality labels to analyze the causes. We assess performance of SNCV on diagnosis of glaucoma suspect risk from retinal fundus photographs, a clinically important yet nuanced labeling task. Using training data from a previously-published deep learning model, we compute a continuous quality score (QS) for each training example. We relabel 1,277 low-QS examples using a trained glaucoma specialist; the new labels agree with the SNCV prediction over the initial label >85% of the time, indicating that low-QS examples mostly reflect labeler errors. We then quantify the impact of training with only high-QS labels, showing that strong model performance may be obtained with many fewer examples. By applying the method to randomly sub-sampled training dataset, we show that our method can reduce labelling burden by approximately 50% while achieving model performance non-inferior to using the full dataset on multiple held-out test sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题