需要注意

论文标题

Pay Attention when Required

论文作者

Mandava, Swetha, Migacz, Szymon, Florea, Alex Fit

论文摘要

基于变压器的模型由捕获内容含义的交织馈送块组成，并相对昂贵的自我发挥块 - 捕获上下文含义。在本文中，我们探讨了块的权衡和订购，以改善当前的变压器体系结构和拟议的PAR变压器。它的计算时间比变压器XL低35％，该变压器XL通过用馈送前向块替换约63％的自我发项方块，并保留Wikitext-103语言建模基准的困惑。我们进一步验证了我们在Text8和Enwiki8数据集以及BERT模型上的结果。

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

需要注意

Pay Attention when Required

论文作者

论文摘要

加入微信交流群