在非参数多级分类中确定类别变量

论文标题

在非参数多级分类中确定类别变量

Determination of class-specific variables in nonparametric multiple-class classification

论文作者

Chen, Wan-Ping Nicole, Chang, Yuan-chin Ivan

论文摘要

随着技术的发展，通过自动收集设备收集数据变得流行，因此我们通常面对具有冗长变量的数据集，尤其是当收集这些数据集而没有事先进行特定的研究目标时。在文献中已经指出，高维分类问题的难度本质上是由于太多的噪声变量无用而无法减少分类错误，这为决策和增加复杂性和模型解释的困惑提供了较小的好处。因此，必须使用良好的可变选择策略来很好地使用此类数据。尤其是当我们期望将其结果用于后续的应用程序/研究时，模型解剖能力至关重要。 HUS，常规分类措施，例如准确性，灵敏度，精度，不是唯一的绩效任务。在本文中，我们提出了一种基于概率的非参数多类分类方法，并将其与识别单个类别的高影响变量的能力集成在一起，以便我们可以提供有关其分类规则和每个类别的更多信息。所提出的方法的预测能力大约等于贝叶斯规则的预测能力，并且仍然保留“模型解释”的能力。我们报告了所提出方法的渐近特性，并使用合成和真实数据集在不同的分类情况下说明其属性。我们还分别讨论了可变标识和培训样本量确定，并将这些过程汇总为算法，以便用户可以通过不同的计算语言轻松实现它们。

As technology advanced, collecting data via automatic collection devices become popular, thus we commonly face data sets with lengthy variables, especially when these data sets are collected without specific research goals beforehand. It has been pointed out in the literature that the difficulty of high-dimensional classification problems is intrinsically caused by too many noise variables useless for reducing classification error, which offer less benefits for decision-making, and increase complexity, and confusion in model-interpretation. A good variable selection strategy is therefore a must for using such kinds of data well; especially when we expect to use their results for the succeeding applications/studies, where the model-interpretation ability is essential. hus, the conventional classification measures, such as accuracy, sensitivity, precision, cannot be the only performance tasks. In this paper, we propose a probability-based nonparametric multiple-class classification method, and integrate it with the ability of identifying high impact variables for individual class such that we can have more information about its classification rule and the character of each class as well. The proposed method can have its prediction power approximately equal to that of the Bayes rule, and still retains the ability of "model-interpretation." We report the asymptotic properties of the proposed method, and use both synthesized and real data sets to illustrate its properties under different classification situations. We also separately discuss the variable identification, and training sample size determination, and summarize those procedures as algorithms such that users can easily implement them with different computing languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题