论文标题
在多个插补框架内使用聚类和深度学习缺少价值估计
Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework
论文作者
论文摘要
表格数据中的丢失值限制了机器学习的使用和性能,需要插入丢失值的插图。最流行的插入算法可以说是使用方程链(小鼠)进行多次归积,该算法估算了观察值对线性条件的缺失值。本文提出了通过通过集合学习和深层神经网络(DNN)替换小鼠的线性调节(DNN)来提高小鼠的归合精度和估算数据的分类精度的方法。通过从训练数据中获得的群集标签(CISCL)来表征单个样本,进一步提高了插补精度。我们的广泛分析涉及六个表格数据集,多达80%的丢失,三种丢失类型(完全随机失踪,随机丢失,不随机丢失)表明,小鼠内的集合或深度学习都优于基线小鼠(B-鼠)(B-鼠标),两者始终胜过CISCL。结果表明,CISCL加B-鼠标的所有百分比和类型的遗失类型都优于B-小鼠。在大多数实验性情况下,我们提出的基于DNN的小鼠和梯度增强小鼠加CISCL(GB-MICE-CISCL)的表现优于其他七种基线插入算法。在所有丢失百分比中,提出的GB-MICE-CISCL估算数据提高了GB-鼠标归类的数据的分类精度。结果还揭示了小鼠框架在高缺失(> 50%)以及丢失类型不是随机时的缺点。本文提供了一种通用方法,用于识别具有丢失百分比和类型的数据集的最佳插补模型。
Missing values in tabular data restrict the use and performance of machine learning, requiring the imputation of missing values. The most popular imputation algorithm is arguably multiple imputations using chains of equations (MICE), which estimates missing values from linear conditioning on observed values. This paper proposes methods to improve both the imputation accuracy of MICE and the classification accuracy of imputed data by replacing MICE's linear conditioning with ensemble learning and deep neural networks (DNN). The imputation accuracy is further improved by characterizing individual samples with cluster labels (CISCL) obtained from the training data. Our extensive analyses involving six tabular data sets, up to 80% missingness, and three missingness types (missing completely at random, missing at random, missing not at random) reveal that ensemble or deep learning within MICE is superior to the baseline MICE (b-MICE), both of which are consistently outperformed by CISCL. Results show that CISCL plus b-MICE outperforms b-MICE for all percentages and types of missingness. Our proposed DNN based MICE and gradient boosting MICE plus CISCL (GB-MICE-CISCL) outperform seven other baseline imputation algorithms in most experimental cases. The classification accuracy on the data imputed by GB-MICE is improved by proposed GB-MICE-CISCL imputed data across all missingness percentages. Results also reveal a shortcoming of the MICE framework at high missingness (>50%) and when the missing type is not random. This paper provides a generalized approach to identifying the best imputation model for a data set with a missingness percentage and type.