可扩展的MRMR功能选择以处理高维数据集：基于垂直分区的迭代MAPREDUCE框架

论文标题

可扩展的MRMR功能选择以处理高维数据集：基于垂直分区的迭代MAPREDUCE框架

Scalable mRMR feature selection to handle high dimensional datasets: Vertical partitioning based Iterative MapReduce framework

论文作者

Vivek, Yelleti, Prasad, P. S. V. S. Sai

论文摘要

在构建机器学习模型的同时，特征选择（FS）是用于处理数据中不确定性和模糊性的必不可少的预处理步骤。最近，事实证明，最小冗余和最大相关性（MRMR）方法有效地获得了不冗余的特征子集。由于生成了大量数据集，因此必须使用分布式/并行范式设计可扩展的解决方案。 MAPREDUCE解决方案被证明是设计容忍和可扩展解决方案的最佳方法之一。这项工作分析了MRMR特征选择的现有MAPREDUCE方法，并确定其局限性。在当前的研究中，我们提出了一种使用记忆方法的有效的基于垂直分区的方法VMR_MRMR，从而克服了现有的方法局限性。实验分析表明，VMR_MRMR显着超过了现存的方法，并获得了更好的计算增益（C.G）。此外，我们还对水平分配方法HMR_MRMR [1]进行了比较分析，以评估所提出方法的优势和局限性。

While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutions using distributed/parallel paradigms. MapReduce solutions are proven to be one of the best approaches to designing fault-tolerant and scalable solutions. This work analyses the existing MapReduce approaches for mRMR feature selection and identifies the limitations thereof. In the current study, we proposed VMR_mRMR, an efficient vertical partitioning-based approach using a memorization approach, thereby overcoming the extant approaches limitations. The experiment analysis says that VMR_mRMR significantly outperformed extant approaches and achieved a better computational gain (C.G). In addition, we also conducted a comparative analysis with the horizontal partitioning approach HMR_mRMR [1] to assess the strengths and limitations of the proposed approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题