论文标题
坡度的强大筛选规则
The Strong Screening Rule for SLOPE
论文作者
论文摘要
从数据集中提取相关功能,在数据集中,观测值($ n $)要小得多,而预测变量($ p $)是现代统计数据中的主要挑战。分类的L-One惩罚估计(SLOPE)是LASSO的概括,是此设置中的一种有前途的方法。然而,当前的斜率数值程序缺乏各自的套索工具所享受的效率,尤其是在估计完整的正规化路径的情况下。 LASSO效率的关键组成部分是预测筛选规则:允许在估计模型之前丢弃预测变量的规则。这是第一篇建立坡度规则的论文。我们通过检查斜率的子差异来制定筛选规则,并表明该规则是对套索的强规则的概括。我们的规则是启发式,这意味着它可能会错误地丢弃预测因子。我们提出可能发生这种情况的条件,并表明这种情况是罕见的,并且可以通过简单地检查最佳条件而轻松保护这种情况。我们的数值实验表明,该规则在实践中的表现良好,从而通过$ p \ gg n $域中数据的数量级进行改进,并且当$ n \ gg p $时,没有任何其他计算开销。我们还研究了设计矩阵中相关结构对规则的影响,并讨论采用该规则的算法策略。最后,我们在R软件包坡度中提供了有效的规则实施。
Extracting relevant features from data sets where the number of observations ($n$) is much smaller then the number of predictors ($p$) is a major challenge in modern statistics. Sorted L-One Penalized Estimation (SLOPE), a generalization of the lasso, is a promising method within this setting. Current numerical procedures for SLOPE, however, lack the efficiency that respective tools for the lasso enjoy, particularly in the context of estimating a complete regularization path. A key component in the efficiency of the lasso is predictor screening rules: rules that allow predictors to be discarded before estimating the model. This is the first paper to establish such a rule for SLOPE. We develop a screening rule for SLOPE by examining its subdifferential and show that this rule is a generalization of the strong rule for the lasso. Our rule is heuristic, which means that it may discard predictors erroneously. We present conditions under which this may happen and show that such situations are rare and easily safeguarded against by a simple check of the optimality conditions. Our numerical experiments show that the rule performs well in practice, leading to improvements by orders of magnitude for data in the $p \gg n$ domain, as well as incurring no additional computational overhead when $n \gg p$. We also examine the effect of correlation structures in the design matrix on the rule and discuss algorithmic strategies for employing the rule. Finally, we provide an efficient implementation of the rule in our R package SLOPE.