论文标题
部分可观测时空混沌系统的无模型预测
Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers
论文作者
论文摘要
大规模变压器模型已成为各种机器学习应用程序(例如CV和NLP)的事实架构。但是,这些大型模型还引入了过度的培训成本。为了减轻此问题,我们提出了一种新颖的随机和图层令牌掉落方法(Random-LTD),该方法在所有中层层都跳过了输入令牌子集的计算。特别是,随机-LTD达到了相当大的加速和可比的精度与标准训练基线。与其他令牌删除方法相比,随机LTD不需要(1)任何重要的基于得分的指标,(2)任何特殊的令牌处理(例如[Cls])和(3)(3)多层全序长度训练,除了第一层和最后一层。此外,提出了新的Layertoken学习率时间表,以解决解决我们建议的培训机制的大量调整要求的预处理问题。最后,我们证明了随机LTD可以应用于更广泛的应用程序,包括GPT和BERT预处理以及VIT和GPT FINETUNTING任务。我们的结果表明,与基线相比,随机LTD可以节省约33.3%的理论计算成本和25.6%的壁式训练时间,同时在GPT-31.3B上进行类似的零射门评估。
Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (random-LTD), which skips the computation of a subset of the input tokens at all middle layers. Particularly, random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, random-LTD does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., [CLS]), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new LayerToken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that random-LTD can be applied to broader applications, including GPT and BERT pretraining as well as ViT and GPT finetuning tasks. Our results show that random-LTD can save about 33.3% theoretical compute cost and 25.6% wall-clock training time while achieving similar zero-shot evaluations on GPT-31.3B as compared to baseline.