论文标题
停止对课堂不平衡学习的过采样:批判性评论
Stop Oversampling for Class Imbalance Learning: A Critical Review
论文作者
论文摘要
在过去的二十年中,已经采用了过采样来克服从不平衡数据集中学习的挑战。文献中提出了许多解决这一挑战的方法。另一方面,过采样是一个问题。也就是说,在解决现实世界问题时,经过虚拟数据训练的模型可能会出色地失败。过采样方法的基本困难是,鉴于现实生活中的人群,合成的样本可能并不真正属于少数群体。结果,在假装代表少数群体的同时,在这些样本上训练分类器可能会在现实世界中使用该模型时会产生错误的预测。我们在本文中分析了大量的过采样方法,并根据隐藏了许多多数示例,设计了一种新的过采样评估系统,并将其与通过过采样过程产生的示例进行了比较。根据我们的评估系统,我们根据它们错误生成的示例进行了对所有这些方法进行比较。我们使用70多种超采样方法和三种不平衡现实世界数据集的实验表明,所有研究的过采样方法都会产生少数样本,这些样本最有可能是多数的。给定数据和方法,我们认为以当前形式和方法对从类不平衡数据学习不可靠,应在现实世界应用中避免。
For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.