Shortformer：使用较短输入的更好的语言建模

论文标题

Shortformer：使用较短输入的更好的语言建模

Shortformer: Better Language Modeling using Shorter Inputs

论文作者

Press, Ofir, Smith, Noah A., Lewis, Mike

论文摘要

增加输入长度一直是变压器语言建模进步的驱动力。我们确定了较短输入并不有害的条件，并通过两种降低输入长度的新方法实现困惑和效率提高。首先，我们表明，最初训练模型在简短的子序列之前，然后再进行更长的训练会减少整体训练时间，而且令人惊讶的是，大大改善了困惑。其次，我们展示了如何提高变形金刚中复发方法的效率，在生成超过最大长度的序列时，变压器可以立即处理的序列时，模型会在先前处理的令牌上进行条件。现有方法需要计算昂贵的相对位置嵌入；我们引入了一种简单的替代方法，即将绝对位置嵌入到查询和键而不是单词嵌入中，从而有效地产生了较高的结果。我们表明，这些复发模型也受益于短输入长度。结合这些技术将训练加快1.65倍，减少记忆使用情况，并显着改善Wikitext-103上的困惑，而无需添加任何参数。

Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens when generating sequences that exceed the maximal length the transformer can handle at once. Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. We show that these recurrent models also benefit from short input lengths. Combining these techniques speeds up training by a factor of 1.65, reduces memory usage, and substantially improves perplexity on WikiText-103, without adding any parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题