简化森林分类器和回归器

论文标题

简化森林分类器和回归器

Simplification of Forest Classifiers and Regressors

论文作者

Nakamura, Atsuyoshi, Sakurada, Kento

论文摘要

我们研究给定森林分类器或回归剂的尽可能多的分支条件的问题，同时保持分类性能。作为防止准确性降解的限制，我们首先考虑一个给定特征向量的决策路径不得改变。对于一个分支条件，即某个特征的值最多是给定阈值，则满足该约束的值的集合可以表示为间隔。因此，该问题减少到找到最小设置的问题，该集合与同一特征上每组分支条件相交的所有约束满足间隔相交。我们建议使用算法有效地解决此问题的原始问题算法。稍后将放宽约束，以通过允许给定特征向量的一定比例的决策路径变化或允许一定数量的非相互交流约束 - 构成的间隔来促进分支条件的进一步共享。我们还扩展了两种放松的算法。通过使用21个数据集（UCI机器学习存储库中的13个分类和8个回归数据集）和4个分类器/回归器（随机森林，非常随机的树木，Adaboost和Adaboost和梯度提升）的全面实验证明了我们方法的有效性。

We study the problem of sharing as many branching conditions of a given forest classifier or regressor as possible while keeping classification performance. As a constraint for preventing from accuracy degradation, we first consider the one that the decision paths of all the given feature vectors must not change. For a branching condition that a value of a certain feature is at most a given threshold, the set of values satisfying such constraint can be represented as an interval. Thus, the problem is reduced to the problem of finding the minimum set intersecting all the constraint-satisfying intervals for each set of branching conditions on the same feature. We propose an algorithm for the original problem using an algorithm solving this problem efficiently. The constraint is relaxed later to promote further sharing of branching conditions by allowing decision path change of a certain ratio of the given feature vectors or allowing a certain number of non-intersected constraint-satisfying intervals. We also extended our algorithm for both the relaxations. The effectiveness of our method is demonstrated through comprehensive experiments using 21 datasets (13 classification and 8 regression datasets in UCI machine learning repository) and 4 classifiers/regressors (random forest, extremely randomized trees, AdaBoost and gradient boosting).

下载PDF全文

下载文献需遵守相关版权规定

论文标题