关于非凸问题的随机梯度下降的几乎肯定收敛

论文标题

关于非凸问题的随机梯度下降的几乎肯定收敛

On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

论文作者

Mertikopoulos, Panayotis, Hallak, Nadav, Kavis, Ali, Cevher, Volkan

论文摘要

本文分析了随机梯度下降（SGD）的轨迹，以帮助了解该算法在非凸问题中的收敛性。我们首先表明，在非常广泛的阶梯尺寸的时间表下，SGD生成的迭代序列保持界限，并以$ 1 $的概率收敛。随后，除了现有的积极概率保证之外，我们表明SGD避免了严格的鞍点/歧管，概率为$ 1 $的整个阶级策略$ 1 $。最后，我们证明，如果使用$θ（1/n^p）$ spep size size Schange Schedule使用该方法，则该算法的收敛速率为$ \ MATHCAL {O}（1/N^{p}）$。这为调整算法的踏板大小提供了重要的指南，因为它表明，具有消失的阶梯尺寸的凉爽阶段可能会导致更快的收敛。我们使用CIFAR上的重新结构架构来证明这种启发式。

This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems. We first show that the sequence of iterates generated by SGD remains bounded and converges with probability $1$ under a very broad range of step-size schedules. Subsequently, going beyond existing positive probability guarantees, we show that SGD avoids strict saddle points/manifolds with probability $1$ for the entire spectrum of step-size policies considered. Finally, we prove that the algorithm's rate of convergence to Hurwicz minimizers is $\mathcal{O}(1/n^{p})$ if the method is employed with a $Θ(1/n^p)$ step-size schedule. This provides an important guideline for tuning the algorithm's step-size as it suggests that a cool-down phase with a vanishing step-size could lead to faster convergence; we demonstrate this heuristic using ResNet architectures on CIFAR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题