处理不平衡数据：二进制班级问题的案例研究

论文标题

处理不平衡数据：二进制班级问题的案例研究

Handling Imbalanced Data: A Case Study for Binary Class Problems

论文作者

Danquah, Richmond Addo

论文摘要

迄今为止，几年来，解决分类问题的主要问题是数据不平衡问题。由于默认情况下，大多数机器学习算法都假定所有数据都是平衡的，因此该算法没有考虑到数据示例类的分布。结果往往不令人满意，并偏向大多数样本类别的分布。这意味着，使用使用不平衡数据构建的模型而不处理数据失衡可能会误导实践和理论中的不平衡。大多数研究人员都专注于合成少数族裔过度采样技术（SMOTE）和自适应合成（ADASYN）采样方法在他们的工作中独立处理数据不平衡时的应用，并且未能更好地解释这些技术背后的算法。本文重点介绍了合成的过采样技术，并手动计算合成数据点，以增强对算法的易于理解。我们分析了这些综合过度采样技术在不同比率和样本量不同的二进制分类问题上的应用。

For several years till date, the major issues in terms of solving for classification problems are the issues of Imbalanced data. Because majority of the machine learning algorithms by default assumes all data are balanced, the algorithms do not take into consideration the distribution of the data sample class. The results tend to be unsatisfactory and skewed towards the majority sample class distribution. This implies that the consequences as a result of using a model built using an Imbalanced data without handling for the Imbalance in the data could be misleading both in practice and theory. Most researchers have focused on the application of Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) Sampling Approach in handling data Imbalance independently in their works and have failed to better explain the algorithms behind these techniques with computed examples. This paper focuses on both synthetic oversampling techniques and manually computes synthetic data points to enhance easy comprehension of the algorithms. We analyze the application of these synthetic oversampling techniques on binary classification problems with different Imbalanced ratios and sample sizes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题