Foundation Layernorm：将Bert和GPT缩放到1,000层

论文标题

Foundation Layernorm：将Bert和GPT缩放到1,000层

FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers

论文作者

Shen, Dezhou

论文摘要

主流BERT/GPT模型仅包含10至20层，几乎没有文献讨论Deep Bert/GPT的培训。本文提出了一种简单而有效的方法来稳定BERT和GPT培训。我们成功地将BERT和GPT扩展到1,000层，这比以前的BERT和GPT更深。所提出的方法基础底层化可以有效地训练深神经网络，并以1000层量表进行验证。

The mainstream BERT/GPT model contains only 10 to 20 layers, and there is little literature to discuss the training of deep BERT/GPT. This paper proposes a simple yet effective method to stabilize BERT and GPT training. We successfully scale up BERT and GPT to 1,000 layers, which is an order of magnitude deeper than previous BERT and GPT. The proposed method FoundationLayerNormalization enables efficient training of deep neural networks and is validated at the 1000-layer scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题