数据移动就是您所需要的：优化变压器的案例研究

论文标题

数据移动就是您所需要的：优化变压器的案例研究

Data Movement Is All You Need: A Case Study on Optimizing Transformers

论文作者

Ivanov, Andrei, Dryden, Nikoli, Ben-Nun, Tal, Li, Shigang, Hoefler, Torsten

论文摘要

变压器是当今最重要的机器学习工作量之一。训练一个是一项非常密集的任务，通常需要几天或几周，并且非常关注优化变压器。尽管如此，现有的实现并不能有效利用GPU。我们发现数据移动是训练时的关键瓶颈。由于Amdahl的定律和计算性能的大规模改进，培训现在已成为记忆。此外，现有框架使用次优的数据布局。使用这些见解，我们提出了一种用于全球优化变压器中数据运动的配方。我们将数据移动降低多达22.91％，并且在训练BERT编码器层时，总体上可以提高1.30倍的性能，并为整个BERT训练1.19倍。我们的方法更广泛地适用于优化深层神经网络，并提供有关如何应对新兴绩效瓶颈的见解。

Transformers are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute performance, training has now become memory-bound. Further, existing frameworks use suboptimal data layouts. Using these insights, we present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks when training a BERT encoder layer and 1.19x for the entire BERT. Our approach is applicable more broadly to optimizing deep neural networks, and offers insight into how to tackle emerging performance bottlenecks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题