论文标题

需要注意

Pay Attention when Required

论文作者

Mandava, Swetha, Migacz, Szymon, Florea, Alex Fit

论文摘要

基于变压器的模型由捕获内容含义的交织馈送块组成,并相对昂贵的自我发挥块 - 捕获上下文含义。在本文中,我们探讨了块的权衡和订购,以改善当前的变压器体系结构和拟议的PAR变压器。它的计算时间比变压器XL低35%,该变压器XL通过用馈送前向块替换约63%的自我发项方块,并保留Wikitext-103语言建模基准的困惑。我们进一步验证了我们在Text8和Enwiki8数据集以及BERT模型上的结果。

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源