论文标题
强大的子集选择
Robust subset selection
论文作者
论文摘要
最好的子集选择(或“最佳子集”)估计器是稀疏回归的经典工具,在过去的十年中,数学优化的发展使其比以往任何时候都更加可计算。尽管具有理想的统计属性,但最佳的子集估计量仍容易受到异常值的影响,并且在存在单个受污染的数据点的情况下可能会分解。为了解决这个问题,提出了最佳子集的强大适应性,它在响应和预测因子中都对污染具有高度抗性。适应的估计量将子集选择的概念推广到预测因子和观测值,从而除了稀疏性外达到了鲁棒性。此过程,称为“鲁棒子集选择”(或“鲁棒子集”),由组合优化问题定义,用于采用现代离散优化方法。估计量根据其客观值的有限样本分解点的鲁棒性是正式的。为了支持这一结果,报告了关于合成和实际数据的实验,这些实验证明了在污染存在下,强大子集比最佳子集的优越性。重要的是,与连续收缩估计器的稳健适应器相比,稳健的子集在几个指标上的竞争性票价。
The best subset selection (or "best subsets") estimator is a classic tool for sparse regression, and developments in mathematical optimization over the past decade have made it more computationally tractable than ever. Notwithstanding its desirable statistical properties, the best subsets estimator is susceptible to outliers and can break down in the presence of a single contaminated data point. To address this issue, a robust adaption of best subsets is proposed that is highly resistant to contamination in both the response and the predictors. The adapted estimator generalizes the notion of subset selection to both predictors and observations, thereby achieving robustness in addition to sparsity. This procedure, referred to as "robust subset selection" (or "robust subsets"), is defined by a combinatorial optimization problem for which modern discrete optimization methods are applied. The robustness of the estimator in terms of the finite-sample breakdown point of its objective value is formally established. In support of this result, experiments on synthetic and real data are reported that demonstrate the superiority of robust subsets over best subsets in the presence of contamination. Importantly, robust subsets fares competitively across several metrics compared with popular robust adaptions of continuous shrinkage estimators.