SGD的动量缩放的影响

论文标题

SGD的动量缩放的影响

Effects of momentum scaling for SGD

论文作者

Pasechnyuk, Dmitry A., Gasnikov, Alexander, Takáč, Martin

论文摘要

本文研究了与预处理的随机梯度方法的特性。我们专注于具有动量系数$β$的动量更新预处理。为了解释缩放方法的实用效率，我们提供了与预处理相关的规范的收敛分析，并证明缩放率使人们可以摆脱梯度Lipschitz的融合率常数。一路上，我们强调了$β$的重要作用，在各种作者的任意性下，不当之地设置为$ 0.99 ... 9 $。最后，我们提出了自适应$β$和步长值的明确建设性公式。

The paper studies the properties of stochastic gradient methods with preconditioning. We focus on momentum updated preconditioners with momentum coefficient $β$. Seeking to explain practical efficiency of scaled methods, we provide convergence analysis in a norm associated with preconditioner, and demonstrate that scaling allows one to get rid of gradients Lipschitz constant in convergence rates. Along the way, we emphasize important role of $β$, undeservedly set to constant $0.99...9$ at the arbitrariness of various authors. Finally, we propose the explicit constructive formulas for adaptive $β$ and step size values.

下载PDF全文

下载文献需遵守相关版权规定

论文标题