论文标题
节奏:通过减少内存足迹加速基于变压器的模型训练
Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction
论文作者
论文摘要
培训深度学习模型在计算上可能很昂贵。先前的工作表明,增加批量大小可能会导致更好的总体吞吐量。但是,由于为训练向后通过的激活/特征图,由于较大的批次大小需要存储较大的特征映射,因此批量大小通常受到加速器内存能力的限制。基于变形金刚的模型由于其良好的性能和适用于各种任务的适用性而引起了人们的流行,它们也存在类似的问题。为了解决这个问题,我们提出了Tempo,这是一种有效使用加速器(例如GPU)内存资源来训练基于变压器的模型的新方法。我们的方法为GELU,LaiseNorm和注意层提供了替换,从而减少了记忆使用情况,并最终导致了更有效的训练。我们实施节奏并评估BERT大型训练任务的吞吐量,内存使用和准确性/损失。我们证明,速度可在最先进的基线上启用高达2倍的批量尺寸和16%的训练吞吐量。我们还评估了GPT2和Roberta模型上的节奏,在基线上显示了19%和26%的速度。
Training deep learning models can be computationally expensive. Prior works have shown that increasing the batch size can potentially lead to better overall throughput. However, the batch size is frequently limited by the accelerator memory capacity due to the activations/feature maps stored for the training backward pass, as larger batch sizes require larger feature maps to be stored. Transformer-based models, which have recently seen a surge in popularity due to their good performance and applicability to a variety of tasks, have a similar problem. To remedy this issue, we propose Tempo, a new approach to efficiently use accelerator (e.g., GPU) memory resources for training Transformer-based models. Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage and ultimately leading to more efficient training. We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERT Large pre-training task. We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19% and 26% speedup over the baseline.