论文标题
SGD的动量缩放的影响
Effects of momentum scaling for SGD
论文作者
论文摘要
本文研究了与预处理的随机梯度方法的特性。我们专注于具有动量系数$β$的动量更新预处理。为了解释缩放方法的实用效率,我们提供了与预处理相关的规范的收敛分析,并证明缩放率使人们可以摆脱梯度Lipschitz的融合率常数。一路上,我们强调了$β$的重要作用,在各种作者的任意性下,不当之地设置为$ 0.99 ... 9 $。最后,我们提出了自适应$β$和步长值的明确建设性公式。
The paper studies the properties of stochastic gradient methods with preconditioning. We focus on momentum updated preconditioners with momentum coefficient $β$. Seeking to explain practical efficiency of scaled methods, we provide convergence analysis in a norm associated with preconditioner, and demonstrate that scaling allows one to get rid of gradients Lipschitz constant in convergence rates. Along the way, we emphasize important role of $β$, undeservedly set to constant $0.99...9$ at the arbitrariness of various authors. Finally, we propose the explicit constructive formulas for adaptive $β$ and step size values.