跨音阶的语言模型的培训轨迹

论文标题

跨音阶的语言模型的培训轨迹

Training Trajectories of Language Models Across Scales

论文作者

Xia, Mengzhou, Artetxe, Mikel, Zhou, Chunting, Lin, Xi Victoria, Pasunuru, Ramakanth, Chen, Danqi, Zettlemoyer, Luke, Stoyanov, Ves

论文摘要

扩展语言模型导致了前所未有的性能提升，但是关于训练动态如何随着模型的变化而变化几乎没有理解。不同规模的语言模型如何在预训练期间学习？为什么较大的语言模型表现出更理想的行为？在本文中，我们分析了不同尺寸的OPT模型（Zhang等，2022）的中间训练检查点（从125m到175B参数） - 在下一步的预测，序列级别的生成和下游任务。我们发现1）在给定的困惑和独立于模型尺寸的情况下，训练令牌的类似子集观察到损失最显着的减少，其余的停滞或表现出双重衰老的行为； 2）在训练的早期，所有模型都学会减少包含幻觉的语法序列的困惑，而小型模型在此次优的分布下停止，并且较大的模型最终学会了分配这些序列较低的概率； 3）困惑性是从大基础上的74个多项选择任务上对内下文学习绩效的有力预测指标，这与模型大小无关。总之，这些结果表明，困惑性比模型大小或训练计算更能预测模型行为。

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题