学习哪些特征很重要：罗伯塔（Roberta）偏爱语言概括（最终）

论文标题

学习哪些特征很重要：罗伯塔（Roberta）偏爱语言概括（最终）

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

论文作者

Warstadt, Alex, Zhang, Yian, Li, Haau-Sing, Liu, Haokun, Bowman, Samuel R.

论文摘要

仔细研究自我监督的语言任务的原因之一是，它教授有助于语言理解的模型功能。但是，我们希望经过审计的模型不仅要学习以表示语言特征，而且还希望在微调期间优先使用这些功能。考虑到这个目标，我们引入了一个名为MSGS（混合信号概括集）的新的英语诊断集，该集由20个模棱两可的二进制分类任务组成，我们用来测试预验证的模型在微调过程中是否更喜欢语言或表面概括。我们从头开始将罗伯塔（Roberta）的模型从100万到1b单词不等的数据数量中验证，并将其在MSG上的性能与公开可用的Roberta-Base进行比较。我们发现，模型可以学会在几乎没有预处理数据的情况下学会表示语言特征，但是需要更多的数据来学会更喜欢语言概括而不是表面概括。最终，有了大约30b的数据，Roberta-base确实表现出具有一些规律性的语言偏见。我们得出的结论是，虽然自我监督的预处理是学习有用的归纳偏见的有效方法，但仍有可能提高模型学习哪些功能的速率的空间。

One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning. We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa-base does demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.

下载PDF全文

下载文献需遵守相关版权规定

论文标题