论文标题
数据选择:建立小型可解释模型的一般原则
Data Selection: A General Principle for Building Small Interpretable Models
论文作者
论文摘要
我们为建立准确的小型模型的有效且一般策略提供了令人信服的经验证据。这样的模型对可解释性具有吸引力,也可以在资源受限的环境中找到使用。该策略是从提供的培训数据中学习培训分布并相应地进行采样。分布学习算法不是这项工作的贡献。我们的贡献是对在各种实际环境中这种策略的广泛效用的严格证明。我们将其应用于(1)建筑集群解释树的任务,(2)基于原型的分类以及(3)使用随机森林的分类,并表明它提高了数十年历史的较弱的传统基线的准确性,以与专业现代技术具有竞争力。 该策略也是模型大小的概念。在前两个任务中,模型大小分别是树中的叶子数和原型的数量。在涉及随机森林的最终任务中,即使模型大小包含多个因素:树木数量及其最大深度,该策略也被证明是有效的。 提出了使用多个数据集的阳性结果,这些结果被证明具有统计学意义。
We present convincing empirical evidence for an effective and general strategy for building accurate small models. Such models are attractive for interpretability and also find use in resource-constrained environments. The strategy is to learn the training distribution and sample accordingly from the provided training data. The distribution learning algorithm is not a contribution of this work; our contribution is a rigorous demonstration of the broad utility of this strategy in various practical settings. We apply it to the tasks of (1) building cluster explanation trees, (2) prototype-based classification, and (3) classification using Random Forests, and show that it improves the accuracy of decades-old weak traditional baselines to be competitive with specialized modern techniques. This strategy is also versatile wrt the notion of model size. In the first two tasks, model size is considered to be number of leaves in the tree and the number of prototypes respectively. In the final task involving Random Forests, the strategy is shown to be effective even when model size comprises of more than one factor: number of trees and their maximum depth. Positive results using multiple datasets are presented that are shown to be statistically significant.