GSHARD：通过有条件计算和自动碎片的扩展巨型模型

论文标题

GSHARD：通过有条件计算和自动碎片的扩展巨型模型

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

论文作者

Lepikhin, Dmitry, Lee, HyoukJoong, Xu, Yuanzhong, Chen, Dehao, Firat, Orhan, Huang, Yanping, Krikun, Maxim, Shazeer, Noam, Chen, Zhifeng

论文摘要

神经网络扩展对于通过大量培训数据和计算来改善许多现实世界的机器学习应用程序的模型质量至关重要。尽管这种缩放趋势被确认为更好的模型质量方法，但在路径上存在挑战，例如计算成本，易于编程和对并行设备上的有效实施。 GSHARD是由一组轻量级注释API和XLA编译器的扩展名组成的模块。它提供了一种优雅的方式来表达广泛的并行计算模式，对现有模型代码的更改最小。 GSHARD使我们能够使用自动碎片来扩展多语言神经机器翻译变压器模型，超过6000亿个参数。我们证明，与先前的艺术相比，可以在4天内对2048 TPU V3加速器进行有效培训，以在4天内对2048 TPU V3加速器进行培训。

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题