以极简主义的示例了解稳定边缘训练动力学

论文标题

以极简主义的示例了解稳定边缘训练动力学

Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

论文作者

Zhu, Xingyu, Wang, Zixuan, Wang, Xiang, Zhou, Mo, Ge, Rong

论文摘要

最近，研究人员观察到，深神经网络的梯度下降以``稳定性边缘稳定性''（EOS）制度运行：锋利性（Hessian的最大特征值）通常大于稳定性阈值$ 2/η$（其中$η$是步骤尺寸）。尽管如此，从长远来看，损失仍会振荡和收敛，最终的清晰度略低于$ 2/η$。尽管尽管敏锐度很大，但许多其他良好的非概念目标（例如矩阵分解或两层网络）也可能会收敛，但端点的清晰度与$ 2/η$之间通常存在较大的差距。在本文中，我们通过构建具有相同行为的简单函数来研究EOS现象。我们对其在大型地区的训练动态进行了严格的分析，并解释了为什么最终的融合点的清晰度接近$ 2/η$。在全球范围内，我们观察到我们示例的训练动力学具有有趣的分叉行为，在神经网的训练中也观察到。

Recently, researchers observed that gradient descent for deep neural networks operates in an ``edge-of-stability'' (EoS) regime: the sharpness (maximum eigenvalue of the Hessian) is often larger than stability threshold $2/η$ (where $η$ is the step size). Despite this, the loss oscillates and converges in the long run, and the sharpness at the end is just slightly below $2/η$. While many other well-understood nonconvex objectives such as matrix factorization or two-layer networks can also converge despite large sharpness, there is often a larger gap between sharpness of the endpoint and $2/η$. In this paper, we study EoS phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to $2/η$. Globally we observe that the training dynamics for our example has an interesting bifurcating behavior, which was also observed in the training of neural nets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题