论文标题
Convgen:凸出空间学习改善了对较小数据集上表格不平衡分类的深度产生的过采样
ConvGeN: Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets
论文作者
论文摘要
数据通常以表格格式存储。几个研究领域容易出现不平衡的表格数据。由于阶级失衡,对此类数据的监督机器学习通常很困难。合成数据生成,即过采样,是一种用于提高分类器性能的常见补救措施。最先进的线性插值方法,例如Loras和Prowras,可用于从少数族裔类的凸空间中生成合成样本,以在这种情况下提高分类器的性能。深层生成网络是合成样品产生的常见深度学习方法,广泛用于合成图像生成。但是,在分类不平衡的情况下,它们对合成表格数据生成的范围尚未得到充分探讨。在本文中,我们表明,与基于线性插值的方法相比,现有的深层生成模型在较小的表格数据集上的分类问题不平衡。为了克服这一点,我们提出了一个深层的生成模型,将凸出空间学习与深层生成模型相结合的Convgen结合在一起。 Convgen了解了少数族类样品的凸组合的系数,因此合成数据与多数类足够不同。我们的基准测量实验表明,与现有的深层生成模型相比,我们提出的模型Convgen改善了此类小数据集上的不平衡分类,同时与现有的线性插值方法相比。此外,我们讨论了如何将模型用于一般的综合表格数据生成,即使在数据不平衡的范围之内,也可以提高凸空间学习的整体适用性。
Data is commonly stored in tabular format. Several fields of research are prone to small imbalanced tabular data. Supervised Machine Learning on such data is often difficult due to class imbalance. Synthetic data generation, i.e., oversampling, is a common remedy used to improve classifier performance. State-of-the-art linear interpolation approaches, such as LoRAS and ProWRAS can be used to generate synthetic samples from the convex space of the minority class to improve classifier performance in such cases. Deep generative networks are common deep learning approaches for synthetic sample generation, widely used for synthetic image generation. However, their scope on synthetic tabular data generation in the context of imbalanced classification is not adequately explored. In this article, we show that existing deep generative models perform poorly compared to linear interpolation based approaches for imbalanced classification problems on smaller tabular datasets. To overcome this, we propose a deep generative model, ConvGeN that combines the idea of convex space learning with deep generative models. ConvGeN learns the coefficients for the convex combinations of the minority class samples, such that the synthetic data is distinct enough from the majority class. Our benchmarking experiments demonstrate that our proposed model ConvGeN improves imbalanced classification on such small datasets, as compared to existing deep generative models, while being at-par with the existing linear interpolation approaches. Moreover, we discuss how our model can be used for synthetic tabular data generation in general, even outside the scope of data imbalance and thus, improves the overall applicability of convex space learning.