论文标题
主动令牌搅拌机
Active Token Mixer
论文作者
论文摘要
现有的三个主导网络家族,即CNN,变形金刚和MLP,主要在融合空间上下文信息的方式上彼此不同,从而使设计在骨干建筑开发的核心中更有效地混合混合机制。在这项工作中,我们提出了一种称为Active令牌混音器(ATM)的创新令牌混合器,以积极地将来自其他令牌的不同渠道分布在给定查询令牌中的灵活上下文信息。该基本操作员积极预测在哪里捕获有用的上下文,并学习如何将捕获的上下文与渠道级别的查询令牌融合在一起。这样,可以将混合令牌的空间范围扩展到具有有限的计算复杂性的全球范围,在该范围内,将令牌混音的方式改革了。我们将ATM作为主要操作员,然后将ATMS组装成一个称为ATMNET的级联体系结构。广泛的实验表明,ATMNET通常是适用的,并且通过在广泛的视觉任务(包括视觉识别和密集的预测任务)上明确的范围,可以全面地超过SOTA视觉骨架的不同家族。代码可在https://github.com/microsoft/activemlp上找到。
The three existing dominant network families, i.e., CNNs, Transformers, and MLPs, differ from each other mainly in the ways of fusing spatial contextual information, leaving designing more effective token-mixing mechanisms at the core of backbone architecture development. In this work, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate flexible contextual information distributed across different channels from other tokens into the given query token. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the query token at channel level. In this way, the spatial range of token-mixing can be expanded to a global scope with limited computational complexity, where the way of token-mixing is reformed. We take ATM as the primary operator and assemble ATMs into a cascade architecture, dubbed ATMNet. Extensive experiments demonstrate that ATMNet is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. Code is available at https://github.com/microsoft/ActiveMLP.