论文标题
Foundation Layernorm:将Bert和GPT缩放到1,000层
FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers
论文作者
论文摘要
主流BERT/GPT模型仅包含10至20层,几乎没有文献讨论Deep Bert/GPT的培训。本文提出了一种简单而有效的方法来稳定BERT和GPT培训。我们成功地将BERT和GPT扩展到1,000层,这比以前的BERT和GPT更深。所提出的方法基础底层化可以有效地训练深神经网络,并以1000层量表进行验证。
The mainstream BERT/GPT model contains only 10 to 20 layers, and there is little literature to discuss the training of deep BERT/GPT. This paper proposes a simple yet effective method to stabilize BERT and GPT training. We successfully scale up BERT and GPT to 1,000 layers, which is an order of magnitude deeper than previous BERT and GPT. The proposed method FoundationLayerNormalization enables efficient training of deep neural networks and is validated at the 1000-layer scale.