论文标题

从更糟的情况下变得更好:增强行李和一个警示性的故事很重要

Getting Better from Worse: Augmented Bagging and a Cautionary Tale of Variable Importance

论文作者

Mentch, Lucas, Zhou, Siyu

论文摘要

随着数据的规模,复杂性和可用性的不断增长,科学家越来越依赖黑盒学习算法,这些算法通常可以提供最小的先验模型规格,以提供准确的预测。诸如随机森林之类的工具具有现成成功的既定记录,甚至提供了分析变量之间基本关系的各种策略。在这里,由于最近对随机森林行为的见解,我们介绍了一个简单的增强装袋(Augbagg)的想法,该程序以与古典行李和随机森林相同的方式运行,但在包含较大,增强的空间中运行,其中包含其他随机生成的随机生成的噪声特征。令人惊讶的是,我们证明了在模型中包含额外的噪声变量的这种简单行为会导致样本外预测准确性的显着改善,有时甚至表现优于最佳调谐传统的随机森林。结果,基于改进的模型准确性的直观概念可能会严重缺陷,因为即使是纯粹的随机噪声也可以常规地记录为具有统计学意义的。提供了关于真实数据和合成数据的大量演示以及提出的解决方案。

As the size, complexity, and availability of data continues to grow, scientists are increasingly relying upon black-box learning algorithms that can often provide accurate predictions with minimal a priori model specifications. Tools like random forests have an established track record of off-the-shelf success and even offer various strategies for analyzing the underlying relationships among variables. Here, motivated by recent insights into random forest behavior, we introduce the simple idea of augmented bagging (AugBagg), a procedure that operates in an identical fashion to classical bagging and random forests, but which operates on a larger, augmented space containing additional randomly generated noise features. Surprisingly, we demonstrate that this simple act of including extra noise variables in the model can lead to dramatic improvements in out-of-sample predictive accuracy, sometimes outperforming even an optimally tuned traditional random forest. As a result, intuitive notions of variable importance based on improved model accuracy may be deeply flawed, as even purely random noise can routinely register as statistically significant. Numerous demonstrations on both real and synthetic data are provided along with a proposed solution.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源