部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

论文作者

Yao, Zhewei, Wu, Xiaoxia, Li, Conglong, Holmes, Connor, Zhang, Minjia, Li, Cheng, He, Yuxiong

论文摘要

大规模变压器模型已成为各种机器学习应用程序（例如CV和NLP）的事实架构。但是，这些大型模型还引入了过度的培训成本。为了减轻此问题，我们提出了一种新颖的随机和图层令牌掉落方法（Random-LTD），该方法在所有中层层都跳过了输入令牌子集的计算。特别是，随机-LTD达到了相当大的加速和可比的精度与标准训练基线。与其他令牌删除方法相比，随机LTD不需要（1）任何重要的基于得分的指标，（2）任何特殊的令牌处理（例如[Cls]）和（3）（3）多层全序长度训练，除了第一层和最后一层。此外，提出了新的Layertoken学习率时间表，以解决解决我们建议的培训机制的大量调整要求的预处理问题。最后，我们证明了随机LTD可以应用于更广泛的应用程序，包括GPT和BERT预处理以及VIT和GPT FINETUNTING任务。我们的结果表明，与基线相比，随机LTD可以节省约33.3％的理论计算成本和25.6％的壁式训练时间，同时在GPT-31.3B上进行类似的零射门评估。

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (random-LTD), which skips the computation of a subset of the input tokens at all middle layers. Particularly, random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, random-LTD does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., [CLS]), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new LayerToken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that random-LTD can be applied to broader applications, including GPT and BERT pretraining as well as ViT and GPT finetuning tasks. Our results show that random-LTD can save about 33.3% theoretical compute cost and 25.6% wall-clock training time while achieving similar zero-shot evaluations on GPT-31.3B as compared to baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题