论文标题
揭开局部聚合为何帮助:层次SGD的收敛分析
Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD
论文作者
论文摘要
分层SGD(H-SGD)已成为多层通信网络的新分布式SGD算法。在H-SGD中,在每个全局聚合之前,工人将其更新的本地模型发送给本地服务器以进行聚合。尽管最近的研究工作,但本地聚集对全球融合的影响仍然缺乏理论上的理解。在这项工作中,我们首先介绍了“向上”和“向下”分歧的新概念。然后,我们使用它进行新的分析,以获得具有非IID数据,非convex目标函数和随机梯度的两级H-SGD的最坏情况收敛的上限。通过将此结果通过随机分组扩展到案例,我们观察到,H-SGD的这种收敛上限位于两个单级局部SGD设置的上限之间,局部迭代的数量分别等于H-SGD中的本地和全局更新周期。我们将其称为“三明治行为”。此外,我们基于“向上”和“向下”差异扩展了分析方法,以研究H-SGD的一般情况的收敛性,其中“三明治行为”仍然存在。我们的理论结果提供了关键的见解,说明为什么局部聚集可以有益地改善H-SGD的收敛性。
Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on global convergence still lacks theoretical understanding. In this work, we first introduce a new notion of "upward" and "downward" divergences. We then use it to conduct a novel analysis to obtain a worst-case convergence upper bound for two-level H-SGD with non-IID data, non-convex objective function, and stochastic gradient. By extending this result to the case with random grouping, we observe that this convergence upper bound of H-SGD is between the upper bounds of two single-level local SGD settings, with the number of local iterations equal to the local and global update periods in H-SGD, respectively. We refer to this as the "sandwich behavior". Furthermore, we extend our analytical approach based on "upward" and "downward" divergences to study the convergence for the general case of H-SGD with more than two levels, where the "sandwich behavior" still holds. Our theoretical results provide key insights of why local aggregation can be beneficial in improving the convergence of H-SGD.