论文标题
从嘈杂的人群到监督机器学习模型的端到端学习
End-to-End Learning from Noisy Crowd to Supervised Machine Learning Models
论文作者
论文摘要
将现实世界数据集标记为耗时,但对于监督机器学习模型来说是必不可少的。一个常见的解决方案是通过众包在大量非专家工人上分配标签任务。由于人群工人的背景和经验不同,获得的标签很容易出现错误,甚至不利于学习模型。在本文中,我们提倡使用混合智能,即结合深层模型和人类专家,从嘈杂的人群来源数据中设计一个端到端的学习框架,尤其是在在线情况下。我们首先总结了最新的解决方案,该解决方案应对非专家人群的嘈杂标签的挑战,并向多个注释者学习。我们展示了标签聚合如何从估计注释者的混淆矩阵以改善学习过程中受益。此外,在专家标签和分类器的帮助下,我们清理了综合样本的汇总标签,以提高最终分类精度。我们使用SVM和深神经网络证明了我们在多个图像数据集(即UCI和CIFAR-10)上策略的有效性。我们的评估表明,我们的在线标签聚合具有混淆矩阵估计,将标签的错误率降低了30%以上。此外,使用专家的数据仅将10%的数据重新标记为SVM的分类精度超过90%。
Labeling real-world datasets is time consuming but indispensable for supervised machine learning models. A common solution is to distribute the labeling task across a large number of non-expert workers via crowd-sourcing. Due to the varying background and experience of crowd workers, the obtained labels are highly prone to errors and even detrimental to the learning models. In this paper, we advocate using hybrid intelligence, i.e., combining deep models and human experts, to design an end-to-end learning framework from noisy crowd-sourced data, especially in an on-line scenario. We first summarize the state-of-the-art solutions that address the challenges of noisy labels from non-expert crowd and learn from multiple annotators. We show how label aggregation can benefit from estimating the annotators' confusion matrices to improve the learning process. Moreover, with the help of an expert labeler as well as classifiers, we cleanse aggregated labels of highly informative samples to enhance the final classification accuracy. We demonstrate the effectiveness of our strategies on several image datasets, i.e. UCI and CIFAR-10, using SVM and deep neural networks. Our evaluation shows that our on-line label aggregation with confusion matrix estimation reduces the error rate of labels by over 30%. Furthermore, relabeling only 10% of the data using the expert's results in over 90% classification accuracy with SVM.