论文标题
破坏变压器的异常值尺寸是由频率驱动的
Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
论文作者
论文摘要
尽管基于变压器的语言模型通常非常健壮,但最近发现的异常现象:在Bert-Base中,仅在1.10亿个参数中仅禁用48个现象,在MNLI上的性能下降了近30%。我们将原始证据复制为异常现象,并将其与嵌入空间的几何形状联系起来。我们发现,在Bert和Roberta中,与离群尺寸相对应的隐藏状态系数的大小与预训练数据中编码的令牌的频率相关,并且还有助于“垂直”自我注意事项模式,使该模型能够专注于特殊令牌。这解释了禁用异常值的性能下降,这表明在未来模型中降低各向异性,我们需要训练预训练模式,这将更好地考虑到偏斜的令牌分布。
While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlates with the frequency of encoded tokens in pre-training data, and it also contributes to the "vertical" self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotropicity in future models we need pre-training schemas that would better take into account the skewed token distributions.