重新思考将耦合嵌入到预训练的语言模型中

论文标题

重新思考将耦合嵌入到预训练的语言模型中

Rethinking embedding coupling in pre-trained language models

论文作者

Chung, Hyung Won, Févry, Thibault, Tsai, Henry, Johnson, Melvin, Ruder, Sebastian

论文摘要

我们重新评估了最先进的预训练语言模型中输入和输出嵌入之间共享权重的标准实践。我们表明，解耦嵌入提供了增加的建模灵活性，从而使我们能够显着提高多语言模型输入嵌入中参数分配的效率。通过在变压器层中重新分配输入嵌入参数，我们在微调过程中具有相同数量的参数数量的标准自然语言理解任务上的性能明显更好。我们还表明，将额外的容量分配给输出嵌入为模型提供了好处，该模型即使在预训练后丢弃了输出嵌入，即使输出嵌入被丢弃。我们的分析表明，较大的输出嵌入阻止模型的最后一层过度专业化到预训练任务，并鼓励变形金刚表示更一般，更可转移到其他任务和语言中。利用这些发现，我们能够训练在Xtreme基准上实现强大性能的模型，而无需在微调阶段增加参数的数量。

We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters during fine-tuning. We also show that allocating additional capacity to the output embedding provides benefits to the model that persist through the fine-tuning stage even though the output embedding is discarded after pre-training. Our analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages. Harnessing these findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.

下载PDF全文

下载文献需遵守相关版权规定

论文标题