一个注释足够了吗？以数据为中心的图像分类基准，用于嘈杂和模棱两可的标签估计

论文标题

一个注释足够了吗？以数据为中心的图像分类基准，用于嘈杂和模棱两可的标签估计

Is one annotation enough? A data-centric image classification benchmark for noisy and ambiguous label estimation

论文作者

Schmarje, Lars, Grossmann, Vasco, Zelenka, Claudius, Dippel, Sabine, Kiko, Rainer, Oszust, Mariusz, Pastell, Matti, Stracke, Jenny, Valros, Anna, Volkmann, Nina, Koch, Reinhard

论文摘要

高质量的数据对于现代机器学习是必需的。但是，由于人类的嘈杂和模棱两可的注释，难以获取此类数据。确定图像标签的这种注释的聚合导致数据质量较低。我们提出了一个以数据为中心的图像分类基准，该基准有十个现实世界数据集和每个图像的多个注释，以使研究人员可以调查和量化此类数据质量问题的影响。通过基准，我们可以通过将新颖的方法应用于各种不同算法和多种数据集的方式，研究注释成本和（半）监督方法对图像分类的数据质量的影响。我们的基准测试在第一阶段使用数据标签改进方法使用两相方法，在第二阶段使用固定的评估模型。因此，我们对输入标签工作和（半）监督算法的性能之间的关系进行了衡量，以更深入地了解如何创建标签以进行有效的模型培训。在数千个实验中，我们表明一个注释还不够，并且包含多个注释可以更好地近似实际的基础类别分布。我们确定硬标签无法捕获数据的歧义，这可能会导致过度自信模型的常见问题。基于提出的数据集，基准方法和分析，我们为未来创造了多个研究机会，该研究机会致力于改善标签噪声估计方法，数据注释方案，现实（半）监督的学习或更可靠的图像收集。

High-quality data is necessary for modern machine learning. However, the acquisition of such data is difficult due to noisy and ambiguous annotations of humans. The aggregation of such annotations to determine the label of an image leads to a lower data quality. We propose a data-centric image classification benchmark with ten real-world datasets and multiple annotations per image to allow researchers to investigate and quantify the impact of such data quality issues. With the benchmark we can study the impact of annotation costs and (semi-)supervised methods on the data quality for image classification by applying a novel methodology to a range of different algorithms and diverse datasets. Our benchmark uses a two-phase approach via a data label improvement method in the first phase and a fixed evaluation model in the second phase. Thereby, we give a measure for the relation between the input labeling effort and the performance of (semi-)supervised algorithms to enable a deeper insight into how labels should be created for effective model training. Across thousands of experiments, we show that one annotation is not enough and that the inclusion of multiple annotations allows for a better approximation of the real underlying class distribution. We identify that hard labels can not capture the ambiguity of the data and this might lead to the common issue of overconfident models. Based on the presented datasets, benchmarked methods, and analysis, we create multiple research opportunities for the future directed at the improvement of label noise estimation approaches, data annotation schemes, realistic (semi-)supervised learning, or more reliable image collection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题