STL-SGD：通过舞台通信期加速当地SGD

论文标题

STL-SGD：通过舞台通信期加速当地SGD

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

论文作者

Shen, Shuheng, Cheng, Yifei, Liu, Jingchang, Xu, Linli

论文摘要

分布式平行的随机梯度下降算法是用于大规模机器学习任务的工作马。其中，局部随机梯度下降（本地SGD）由于其沟通复杂性低而引起了极大的关注。先前的研究证明，本地SGD与固定或自适应通信期的通信复杂性在$ O（N^{\ frac {3} {2}} {2}} {2}} t^{\ frac {\ frac {1} {2}}}}}}}}} $（当客户端上的数据分布相同（IID）或其他（非IID）时，t^{\ frac {3} {4}}）$，其中$ n $是客户端的数量，$ t $是迭代的数量。在本文中，为了通过降低通信复杂性来加速收敛，我们建议\ textit {st} agewise \ textit {l} ocal \ textit \ textit {sgd}（stl-sgd），这逐渐增加了通信期，并随着降低的学习率而逐渐增加。我们证明，STL-SGD可以保持与迷你批次SGD相同的收敛速率和线性加速。此外，作为增加沟通期的好处，当目标强烈凸出或满足Polyak-goljasiewicz条件时，STL-SGD的沟通复杂性为$ O（n \ log {t}）$和$ o（n^{n^{\ frac {1}}}} {2}} {2}} t^$} $}非IID案例分别比本地SGD取得了重大改进。凸面和非凸问题的实验证明了STL-SGD的出色性能。

Distributed parallel stochastic gradient descent algorithms are workhorses for large scale machine learning tasks. Among them, local stochastic gradient descent (Local SGD) has attracted significant attention due to its low communication complexity. Previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communication period is in the order of $O (N^{\frac{3}{2}} T^{\frac{1}{2}})$ and $O (N^{\frac{3}{4}} T^{\frac{3}{4}})$ when the data distributions on clients are identical (IID) or otherwise (Non-IID), where $N$ is the number of clients and $T$ is the number of iterations. In this paper, to accelerate the convergence by reducing the communication complexity, we propose \textit{ST}agewise \textit{L}ocal \textit{SGD} (STL-SGD), which increases the communication period gradually along with decreasing learning rate. We prove that STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD. In addition, as the benefit of increasing the communication period, when the objective is strongly convex or satisfies the Polyak-Łojasiewicz condition, the communication complexity of STL-SGD is $O (N \log{T})$ and $O (N^{\frac{1}{2}} T^{\frac{1}{2}})$ for the IID case and the Non-IID case respectively, achieving significant improvements over Local SGD. Experiments on both convex and non-convex problems demonstrate the superior performance of STL-SGD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题