论文标题

部分可观测时空混沌系统的无模型预测

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

论文作者

Faw, Matthew, Tziotis, Isidoros, Caramanis, Constantine, Mokhtari, Aryan, Shakkottai, Sanjay, Ward, Rachel

论文摘要

我们研究了Adagrad-norm的收敛速率,作为自适应随机梯度方法(SGD)的典范,其中,基于观察到的随机梯度的步骤大小会改变,以最大程度地减少非凸态,平稳的目标。尽管它们很受欢迎,但在这种情况下,对自适应SGD的分析滞后于非自适应方法。具体而言,所有先前的作品都依赖以下假设的某个子集:(i)统一结合的梯度规范,(ii)统一结合的随机梯度方差(甚至噪声支持),(iii)步骤大小和随机梯度之间的条件独立性。在这项工作中,我们表明Adagrad-norm表现出最佳收敛率的$ \ Mathcal {o} \ left(\ frac {\ frac {\ mathrm {poly} \ log(t)} {\ sqrt {\ sqrt {t}}}} \ rights $ t $ toptirations $ toptiress $ toptive and n nording sgd sgd(仿射噪声方差缩放),至关重要,而无需任何调整参数。因此,我们确定自适应梯度方法在比以前所理解的要广泛的方案中表现出最佳的融合。

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源