论文标题
有损梯度压缩:一点点可以购买多少精度?
Lossy Gradient Compression: How Much Accuracy Can One Bit Buy?
论文作者
论文摘要
在联合学习(FL)中,通过汇总从多个远程学习者获得的模型更新,通过参数服务器(PS)进行了全局模型。通常,远程用户与PS之间的通信是限制的,而从PS到远程用户的传输不受约束。 FL设置产生了分布式学习方案,在该方案中,必须压缩远程学习者的更新,以满足上行链路传输对PS的通信率约束。对于此问题,希望压缩模型更新,以最大程度地减少压缩误差导致的准确性损失。在本文中,我们采用了一种速度延伸方法来解决深层神经网络(DNNS)分布式培训的压缩机设计问题。特别是,我们定义了在通信率约束下的压缩性能的度量 - \ emph {每位准确性} - 解决了一点点通信为集中模型带来的准确性的最终提高。为了最大程度地提高每位准确性,我们考虑将远程学习者的DNN梯度更新建模为广义正态分布。在此对DNN梯度分布的假设下,我们提出了一类失真度量,以帮助设计量化器的设计,以压缩模型更新。我们认为,这种失真措施家族(我们称为“ $ M $ -Magnitude加权$ L_2 $”规范)捕获了从业者的直觉,以选择梯度压缩机。提供数值模拟以验证CIFAR-10数据集的建议方法。
In federated learning (FL), a global model is trained at a Parameter Server (PS) by aggregating model updates obtained from multiple remote learners. Generally, the communication between the remote users and the PS is rate-limited, while the transmission from the PS to the remote users are unconstrained. The FL setting gives rise to the distributed learning scenario in which the updates from the remote learners have to be compressed so as to meet communication rate constraints in the uplink transmission toward the PS. For this problem, one wishes to compress the model updates so as to minimize the loss in accuracy resulting from the compression error. In this paper, we take a rate-distortion approach to address the compressor design problem for the distributed training of deep neural networks (DNNs). In particular, we define a measure of the compression performance under communication-rate constraints -- the \emph{per-bit accuracy} -- which addresses the ultimate improvement of accuracy that a bit of communication brings to the centralized model. In order to maximize the per-bit accuracy, we consider modeling the DNN gradient updates at remote learners as a generalized normal distribution. Under this assumption on the DNN gradient distribution, we propose a class of distortion measures to aid the design of quantizers for the compression of the model updates. We argue that this family of distortion measures, which we refer to as "$M$-magnitude weighted $L_2$" norm, captures the practitioner's intuition in the choice of gradient compressor. Numerical simulations are provided to validate the proposed approach for the CIFAR-10 dataset.