论文标题
$ o(n)$连接足够表达:稀疏变压器的通用近似性
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
论文作者
论文摘要
最近,变压器网络已重新定义了许多NLP任务中的最新技术。但是,这些模型在输入序列长度$ n $中遭受二次计算成本,以计算每一层的成对注意。这促使最近对稀疏变压器的研究稀疏了注意力层中的连接。虽然长期依据有希望,但基本问题仍未得到解决:稀疏的变形金刚能否近似任何任意序列到序列函数,类似于它们的密集对应物?稀疏模式和稀疏度如何影响其性能?在本文中,我们解决了这些问题,并提供了一个统一的框架,以捕获现有的稀疏注意模型。我们提出了足够的条件,在这些条件下,我们证明稀疏注意模型可以普遍近似任何顺序到序列函数。令人惊讶的是,我们的结果表明,每个注意力层的稀疏变压器只有$ o(n)$连接可以与具有$ n^2 $连接的密集模型相同的功能类近似。最后,我们提出了比较标准NLP任务上不同模式/稀疏度的实验。
Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse Transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks.