论文标题
高维非凸优化问题中的最佳学习率时间表
Optimal learning rate schedules in high-dimensional non-convex optimization problems
论文作者
论文摘要
学习率时间表无处不在,以加快和改善优化。已经以经验为基础引入了许多不同的策略,并为凸环制定了理论分析。但是,在许多现实的问题中,损失土地是高维和非凸的,结果很少。在本文中,我们介绍了学习率调度在这种情况下的作用的首次分析研究,重点是langevin优化,学习率衰减为$η(t)= t^{ - β} $。我们首先考虑在$ n $维球($ n \ rightarrow \ infty $)上损失是高斯随机功能的模型,其中包含大量关键点。我们发现,要加快优化的速度而不会陷入马鞍上,必须选择一个衰减率$β<1 $,这与$β= 1 $通常最佳的凸设置相反。然后,我们将要恢复的信号添加到问题中。在这种情况下,动力学分解为两个阶段:一个\ emph {exploration}相,在该相中,动力学在景观的粗糙部分中导航,然后是\ emph {Convergence}相,其中检测到信号,并且动力学输入convex basin。在这种情况下,在探索阶段保持较大的学习率是最佳的,以尽快逃脱非凸区域,然后使用凸标准$β= 1 $迅速收敛到溶液。最后,我们证明了我们的结论符合涉及神经网络的共同回归任务。
Learning rate schedules are ubiquitously used to speed up and improve optimisation. Many different policies have been introduced on an empirical basis, and theoretical analyses have been developed for convex settings. However, in many realistic problems the loss-landscape is high-dimensional and non convex -- a case for which results are scarce. In this paper we present a first analytical study of the role of learning rate scheduling in this setting, focusing on Langevin optimization with a learning rate decaying as $η(t)=t^{-β}$. We begin by considering models where the loss is a Gaussian random function on the $N$-dimensional sphere ($N\rightarrow \infty$), featuring an extensive number of critical points. We find that to speed up optimization without getting stuck in saddles, one must choose a decay rate $β<1$, contrary to convex setups where $β=1$ is generally optimal. We then add to the problem a signal to be recovered. In this setting, the dynamics decompose into two phases: an \emph{exploration} phase where the dynamics navigates through rough parts of the landscape, followed by a \emph{convergence} phase where the signal is detected and the dynamics enter a convex basin. In this case, it is optimal to keep a large learning rate during the exploration phase to escape the non-convex region as quickly as possible, then use the convex criterion $β=1$ to converge rapidly to the solution. Finally, we demonstrate that our conclusions hold in a common regression task involving neural networks.