论文标题
变压器的概率解释
A Probabilistic Interpretation of Transformers
论文作者
论文摘要
我们提出了一种概率的解释,对变形金刚的指数点产品的注意力和基于指数家庭的对比度学习的概率解释。变形金刚的注意力覆盖物等同于日志标准器的梯度上升步,这是Hopfield注意力赛理论中的对数符号术语。该上升步骤诱导了点的平行扩展,这与层归一化的收缩相抵消。我们还陈述了理论和霍普菲尔德理论的理论局限性,并提出了解决方案的方向。
We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. The attention sublayer of transformers is equivalent to a gradient ascent step of the log normalizer, which is the log-sum-exp term in the Hopfield theory of attention. This ascent step induces a parallel expansion of points, which is counterbalanced by a contraction from layer normalization. We also state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.