Paramixer：稀疏因子中的参数化混合链接比点产生自我注意力更好

论文标题

Paramixer：稀疏因子中的参数化混合链接比点产生自我注意力更好

Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

论文作者

Yu, Tong, Khalitov, Ruslan, Cheng, Lei, Yang, Zhirong

论文摘要

自我注意力是神经建模中广泛使用的构件，可以混合远程数据元素。大多数自我发项神经网络使用成对的点产生来指定注意力系数。但是，这些方法需要$ O（n^2）$计算长度$ n $的计算成本。即使已经引入了一些近似方法来减轻二次成本，但在注意矩阵分解中，低级别的约束仍然可以瓶颈点 - 产物方法的性能。在本文中，我们提出了一种新型可扩展有效的混合构建块，称为Paramixer。我们的方法将相互作用矩阵分配到几个稀疏矩阵中，在其中，我们将MLP的非零条目以数据元素作为输入来参数化。新构建块的总体计算成本低至$ O（n \ log n）$。此外，Paramixer中的所有分解矩阵都是完整的，因此它不会遭受低级别的瓶颈的困扰。我们已经在合成和各种真实世界长的顺序数据集上测试了新方法，并将其与几个最新的注意力网络进行了比较。实验结果表明，参数符在大多数学习任务中具有更好的性能。

Self-Attention is a widely used building block in neural modeling to mix long-range data elements. Most self-attention neural networks employ pairwise dot-products to specify the attention coefficients. However, these methods require $O(N^2)$ computing cost for sequence length $N$. Even though some approximation methods have been introduced to relieve the quadratic cost, the performance of the dot-product approach is still bottlenecked by the low-rank constraint in the attention matrix factorization. In this paper, we propose a novel scalable and effective mixing building block called Paramixer. Our method factorizes the interaction matrix into several sparse matrices, where we parameterize the non-zero entries by MLPs with the data elements as input. The overall computing cost of the new building block is as low as $O(N \log N)$. Moreover, all factorizing matrices in Paramixer are full-rank, so it does not suffer from the low-rank bottleneck. We have tested the new method on both synthetic and various real-world long sequential data sets and compared it with several state-of-the-art attention networks. The experimental results show that Paramixer has better performance in most learning tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题