论文标题

以极简主义的示例了解稳定边缘训练动力学

Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

论文作者

Zhu, Xingyu, Wang, Zixuan, Wang, Xiang, Zhou, Mo, Ge, Rong

论文摘要

最近,研究人员观察到,深神经网络的梯度下降以``稳定性边缘稳定性''(EOS)制度运行:锋利性(Hessian的最大特征值)通常大于稳定性阈值$ 2/η$(其中$η$是步骤尺寸)。尽管如此,从长远来看,损失仍会振荡和收敛,最终的清晰度略低于$ 2/η$。尽管尽管敏锐度很大,但许多其他良好的非概念目标(例如矩阵分解或两层网络)也可能会收敛,但端点的清晰度与$ 2/η$之间通常存在较大的差距。在本文中,我们通过构建具有相同行为的简单函数来研究EOS现象。我们对其在大型地区的训练动态进行了严格的分析,并解释了为什么最终的融合点的清晰度接近$ 2/η$。在全球范围内,我们观察到我们示例的训练动力学具有有趣的分叉行为,在神经网的训练中也观察到。

Recently, researchers observed that gradient descent for deep neural networks operates in an ``edge-of-stability'' (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold $2/η$ (where $η$ is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below $2/η$. While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and $2/η$. In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to $2/η$. Globally we observe that the training dynamics for our example has an interesting bifurcating behavior, which was also observed in the training of neural nets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源