使用多机求解器的GRU网络并行培训长序列

论文标题

使用多机求解器的GRU网络并行培训长序列

Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences

论文作者

Moon, Gordon Euhyun, Cyr, Eric C.

论文摘要

并行的门控复发单元（GRU）网络是一项具有挑战性的任务，因为GRU的训练过程本质上是顺序的。先前对GRU并行化的努力主要集中在传统的并行化策略上，例如数据并行和模型并行训练算法。但是，当给定的序列很长时，现有方法在训练时间方面仍然不可避免地有限。在本文中，我们基于时间（mgrit）求解器的多族减少，提出了一种新型的平行训练方案（称为平行时间）。将一个序列分为多个较短的子序列，并在不同的处理器上训练子序列。实现加速的关键是对隐藏状态的层次结构校正，以加速梯度下降的前进和向后传播阶段的端到端通信。 HMDB51数据集的实验结果（每个视频都是图像序列）表明，新的并行训练方案在串行方法上实现了高达6.5 $ \ times $速度。随着我们新的并行化策略的效率与序列长度有关，我们的平行GRU算法随着序列长度的增加而实现了显着的性能提高。

Parallelizing Gated Recurrent Unit (GRU) networks is a challenging task, as the training procedure of GRU is inherently sequential. Prior efforts to parallelize GRU have largely focused on conventional parallelization strategies such as data-parallel and model-parallel training algorithms. However, when the given sequences are very long, existing approaches are still inevitably performance limited in terms of training time. In this paper, we present a novel parallel training scheme (called parallel-in-time) for GRU based on a multigrid reduction in time (MGRIT) solver. MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel. The key to achieving speedup is a hierarchical correction of the hidden state to accelerate end-to-end communication in both the forward and backward propagation phases of gradient descent. Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$\times$ speedup over a serial approach. As efficiency of our new parallelization strategy is associated with the sequence length, our parallel GRU algorithm achieves significant performance improvement as the sequence length increases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题