MAIT：利用注意力面罩来获得更有效的图像变压器

论文标题

MAIT：利用注意力面罩来获得更有效的图像变压器

MaiT: Leverage Attention Masks for More Efficient Image Transformers

论文作者

Li, Ling, Ardestani, Ali Shafiee, Hassoun, Joseph

论文摘要

尽管图像变形金刚在计算机视觉任务中与卷积神经网络显示出竞争性结果，但缺乏电感偏见（例如地方性）仍然会在模型效率方面构成问题，尤其是对于嵌入式应用程序而言。在这项工作中，我们通过引入注意力面具以将空间位置纳入自我发挥主管来解决这个问题。局部依赖性有效地捕获了掩盖的注意力头，以及由未掩盖的注意力头部捕获的全球依赖性。随着注意力变压器的掩盖，MAIT与Cait相比，TOP -1的准确性提高了1.7％，与SWIN相比，吞吐量更少，吞吐量提高了1.5倍。用注意力面罩编码局部性的是模型不可知论，因此它适用于整体，分层或其他新型变压器体系结构。

Though image transformers have shown competitive results with convolutional neural networks in computer vision tasks, lacking inductive biases such as locality still poses problems in terms of model efficiency especially for embedded applications. In this work, we address this issue by introducing attention masks to incorporate spatial locality into self-attention heads. Local dependencies are captured efficiently with masked attention heads along with global dependencies captured by unmasked attention heads. With Masked attention image Transformer - MaiT, top-1 accuracy increases by up to 1.7% compared to CaiT with fewer parameters and FLOPs, and the throughput improves by up to 1.5X compared to Swin. Encoding locality with attention masks is model agnostic, and thus it applies to monolithic, hierarchical, or other novel transformer architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题