论文标题
分散SGD中的重尾现象
Heavy-Tail Phenomenon in Decentralized SGD
论文作者
论文摘要
最近的理论研究表明,即使在令人惊讶的简单设置(例如带有高斯数据的线性回归),重尾可能会因“乘法噪声”而引起的随机优化出现。尽管这些研究发现了几个有趣的现象,但它们考虑了常规的随机优化问题,这些问题排除了在现代机器学习应用中自然出现的分散设置。在本文中,我们研究了分散的随机梯度下降(DE-SGD)中重尾的出现,并研究了权力下放对尾巴行为的影响。我们首先表明,当每个计算节点处的损耗函数是连续的两倍,并且在紧凑区域之外强烈凸出时,DE-SGD的定律会收敛到具有多种腐烂(重)尾巴的分布。为了对尾部指数进行更明确的控制,然后考虑到每个节点上的损失是二次的情况,并表明可以根据阶级,批量大小的函数估计尾部索引,以及计算节点网络的拓扑特性。然后,我们提供理论和经验结果,表明DE-SGD的尾巴比集中式SGD重。我们还将DE-SGD与断开的SGD进行了比较,其中节点分发数据但不通信。我们的理论揭示了尾巴和网络结构之间的有趣相互作用:我们确定了两个参数(步骤尺寸和网络大小)的制度,其中DE-SGD比脱离的SGD可以根据制度更轻或更重的尾巴。最后,为了支持我们的理论结果,我们提供了对合成数据和神经网络进行的数值实验。
Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.