论文标题
了解双重下降需要细粒度的偏差变化分解
Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition
论文作者
论文摘要
经典学习理论表明,机器学习模型的最佳泛化性能应在中间模型的复杂性下发生,更简单的模型表现出较高的偏见和更复杂的模型,表现出很大的预测功能差异。但是,这样一个简单的权衡并不能充分描述深度学习模型,这些模型同时达到了严重的过度参数化制度的偏见和差异。解释这种行为的主要障碍是深度学习算法通常涉及多种随机性来源,其个人贡献在总方差中不可见。为了实现细粒分析,我们将方差的可解释的,对称分解描述为与采样,初始化和标签的随机性相关的术语。此外,我们计算这种分解的高维渐近行为,以进行随机特征核回归,并分析产生的惊人的现象学。我们发现,偏差随网络宽度而单调降低,但是方差项表现出非单调的行为,即使在没有标签噪声的情况下,也可以在插值边界处发散。差异是由采样和初始化之间的\ emph {相互作用}引起的,因此可以通过在初始参数(即安排学习)上对样品(即包装)进行边缘化(即包装)\ emph {or}来消除。
Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning models that simultaneously attain low bias and variance in the heavily overparameterized regime. A primary obstacle in explaining this behavior is that deep learning algorithms typically involve multiple sources of randomness whose individual contributions are not visible in the total variance. To enable fine-grained analysis, we describe an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels. Moreover, we compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyze the strikingly rich phenomenology that arises. We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior and can diverge at the interpolation boundary, even in the absence of label noise. The divergence is caused by the \emph{interaction} between sampling and initialization and can therefore be eliminated by marginalizing over samples (i.e. bagging) \emph{or} over the initial parameters (i.e. ensemble learning).