论文标题
掩盖语言模型对科学的回报减少
The Diminishing Returns of Masked Language Models to Science
论文作者
论文摘要
经过一般语料库培训的基于变压器的掩盖语言模型,在下游任务上表现出令人印象深刻的表现。还已经证明,通过在更多数据上预处理更长的模型可以改善此类模型的下游任务性能。在这项工作中,我们从经验上评估了这些结果扩展到科学任务的程度。我们使用14个基于域特异性变压器的模型(包括Scholarbert,一种新的7700万参数科学的掩盖语言模型,可在多达225B代币上预处理)来评估训练数据,模型大小,预处理和填充时间对12个下游科学任务的影响。有趣的是,我们发现增加的模型大小,培训数据或计算时间并不总是会导致显着改善(即> 1%F1),如果有的话,在科学信息提取任务中,并为令人惊讶的性能差异提供了可能的解释。
Transformer-based masked language models such as BERT, trained on general corpora, have shown impressive performance on downstream tasks. It has also been demonstrated that the downstream task performance of such models can be improved by pretraining larger models for longer on more data. In this work, we empirically evaluate the extent to which these results extend to tasks in science. We use 14 domain-specific transformer-based models (including ScholarBERT, a new 770M-parameter science-focused masked language model pretrained on up to 225B tokens) to evaluate the impact of training data, model size, pretraining and finetuning time on 12 downstream scientific tasks. Interestingly, we find that increasing model sizes, training data, or compute time does not always lead to significant improvements (i.e., >1% F1), if at all, in scientific information extraction tasks and offered possible explanations for the surprising performance differences.