差量化梯度方法

论文标题

差量化梯度方法

Differentially Quantized Gradient Methods

论文作者

Lin, Chung-Yi, Kostina, Victoria, Hassibi, Babak

论文摘要

考虑以下分布式优化方案。工人可以访问它用于计算梯度的培训数据，而服务器决定何时基于目标准确性或延迟约束来停止迭代计算。服务器通过率有限的无噪声通信渠道从工人那里接收有关问题实例的所有信息。我们介绍了我们称为差分量化（DQ）的原理，该原则规定了补偿过去的量化错误，以将量化算法的下降轨迹转移到其非量化算法的轨迹上。假设客观函数是平稳且强烈凸的，我们证明差异量化的梯度下降（DQ-GD）达到了$ \ max \ {σ_{\ mathrm {gd}}}，ρ_n2^{ - r} \} $，其中$ \ n is $ n is $ nm}的线性收缩系数未量化的梯度下降（GD），$ρ_n\ geq 1 $是量化器的覆盖效率，而$ r $是每个问题维度$ n $的比特率。因此，在任何$ r \ geq \log_2ρ_n /σ_{\ mathrm {gd}} $ bits上，dq-gd的收缩因子与未量化的gd的收缩因子相同，即，由于量化而没有损失。我们表明，某个类中没有算法可以比$ \ max \ {σ_{\ mathrm {gd}}，2^{ - r} \} $更快地收敛速度。由于$ρ_n\ to 1 $ as $ n \ to \ infty $（Rogers，1963）存在量化器，因此这意味着DQ-GD在渐近上是最佳的。差异量化的原理继续适用于具有动量的梯度方法，例如Nesterov的加速梯度下降和Polyak的重球方法。对于这些算法，如果速率高于一定阈值，则与其未量化的对应物相比，差异量化算法获得的收缩因子没有损失。在最小二乘问题上的实验结果验证了我们的理论分析。

Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call Differential Quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $\max\{σ_{\mathrm{GD}}, ρ_n 2^{-R}\}$, where $σ_{\mathrm{GD}}$ is the contraction factor of unquantized gradient descent (GD), $ρ_n \geq 1$ is the covering efficiency of the quantizer, and $R$ is the bitrate per problem dimension $n$. Thus at any $R\geq\log_2 ρ_n /σ_{\mathrm{GD}}$ bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show that no algorithm within a certain class can converge faster than $\max\{σ_{\mathrm{GD}}, 2^{-R}\}$. Since quantizers exist with $ρ_n \to 1$ as $n \to \infty$ (Rogers, 1963), this means that DQ-GD is asymptotically optimal. The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak's heavy ball method. For these algorithms as well, if the rate is above a certain threshold, there is no loss in contraction factor obtained by the differentially quantized algorithm compared to its unquantized counterpart. Experimental results on least-squares problems validate our theoretical analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题