累积的解耦学习：减轻层间模型中的梯度稳定性并联

论文标题

累积的解耦学习：减轻层间模型中的梯度稳定性并联

Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization

论文作者

Zhuang, Huiping, Lin, Zhiping, Toh, Kar-Ann

论文摘要

解耦学习是模型并行性的一个分支，它通过将深度划分为多个模块来平行于网络的训练。脱钩学习的技术通常会由于其异步实现而导致陈旧的梯度效应，从而导致性能降解。在本文中，我们提出了一种累积的解耦学习（ADL），该学习结合了梯度积累技术以减轻陈旧的梯度效应。我们给出了有关如何降低梯度陈旧性的理论和经验证据。我们证明，提出的方法可以收敛到临界点，即，尽管梯度具有异步性，但梯度会收敛到0。经验验证是通过训练深层卷积神经网络来执行CIFAR-10和Imagenet数据集的分类任务提供的。显示ADL在分类任务中的表现胜过几个最先进的方法，并且是比较方法中最快的。

Decoupled learning is a branch of model parallelism which parallelizes the training of a network by splitting it depth-wise into multiple modules. Techniques from decoupled learning usually lead to stale gradient effect because of their asynchronous implementation, thereby causing performance degradation. In this paper, we propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect. We give both theoretical and empirical evidences regarding how the gradient staleness can be reduced. We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature. Empirical validation is provided by training deep convolutional neural networks to perform classification tasks on CIFAR-10 and ImageNet datasets. The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题