论文标题
通过两步梯度更新超越稳定的边缘
Beyond the Edge of Stability via Two-step Gradient Updates
论文作者
论文摘要
梯度下降(GD)是现代机器学习的强大主力,这要归功于其在高维空间中的可扩展性和效率。它可以找到本地最小剂的能力仅用于利用Lipschitz梯度的损失,在这种梯度上可以看作是基本梯度流的“善意”离散化。然而,许多涉及过度散热模型的ML设置并不属于这个问题类别,这激发了所谓的``稳定性边缘''(EOS)(EOS)以外的研究,其中阶级规模跨越了与上述Lipschitz常数成反比的可理性阈值。也许令人惊讶的是,无论局部不稳定性和振荡行为如何,GD都被经验观察到仍然会融合。 对该现象的初步理论分析主要集中在过度隔离的政权上,在适当的渐近限制下,选择较大的学习率的效果可能与最小化剂的多种歧视中的“清晰度最小化”隐式正则化有关。相反,在这项工作中,我们直接研究了这种不稳定的收敛条件,重点是简单但代表性的学习问题,通过分析两步梯度更新。具体而言,我们表征了涉及三阶导数的当地条件,该条件保证了两步更新的固定点的存在和收敛,并在人口损失下以教师的环境利用此类财产。最后,从矩阵分解开始,我们提供了高维设置中GD的周期2轨道的观察,并以其动力学的直觉以及探索更通用的设置。
Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems, via analysis of two-step gradient updates. Specifically, we characterize a local condition involving third-order derivatives that guarantees existence and convergence to fixed points of the two-step updates, and leverage such property in a teacher-student setting, under population loss. Finally, starting from Matrix Factorization, we provide observations of period-2 orbit of GD in high-dimensional settings with intuition of its dynamics, along with exploration into more general settings.