论文标题
方向问题:关于随机梯度下降的隐式偏差,学习率中等
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate
论文作者
论文摘要
了解\ emph {随机梯度下降}(SGD)的算法偏差是现代机器学习和深度学习理论的关键挑战之一。但是,大多数现有作品都集中在\ emph {非常小甚至是无限的}学习率制度上,并且无法涵盖学习率为\ emph {中等和退火}的实际情况。在本文中,我们初步尝试通过研究其行为以优化过度参数的线性回归问题来表征中度学习率制度中SGD的特殊正则化效果。在这种情况下,已知SGD和GD会收敛到独特的最小值解决方案。但是,随着中等和退火的学习率,我们表明它们表现出不同的\ emph {方向性偏见}:SGD沿数据矩阵的大特征值方向收敛,而GD则遵循小的特征值方向。此外,我们表明,当采用早期停止时,这种定向偏差确实很重要,而SGD输出几乎是最佳的,但GD输出是次优的。最后,我们的理论解释了用于SGD高参数调整的几种实践中的民间艺术,例如(1)用批处理大小线性扩展初始学习率; (2)即使损失停止减少,也以高学习率的速度超越了SGD。
Understanding the algorithmic bias of \emph{stochastic gradient descent} (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on \emph{very small or even infinitesimal} learning rate regime, and fail to cover practical scenarios where the learning rate is \emph{moderate and annealing}. In this paper, we make an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem. In this case, SGD and GD are known to converge to the unique minimum-norm solution; however, with the moderate and annealing learning rate, we show that they exhibit different \emph{directional bias}: SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. Furthermore, we show that such directional bias does matter when early stopping is adopted, where the SGD output is nearly optimal but the GD output is suboptimal. Finally, our theory explains several folk arts in practice used for SGD hyperparameter tuning, such as (1) linearly scaling the initial learning rate with batch size; and (2) overrunning SGD with high learning rate even when the loss stops decreasing.