预培训语言表示的图形循环网络

论文标题

预培训语言表示的图形循环网络

Pre-Training a Graph Recurrent Network for Language Representation

论文作者

Wang, Yile, Yang, Linyi, Teng, Zhiyang, Zhou, Ming, Zhang, Yue

论文摘要

近年来，基于变压器的预训练模型已获得了很大的进步，成为自然语言处理中最重要的骨干之一。最近的工作表明，变压器内部的注意力机制可能不需要，卷积神经网络和基于多层感知器的模型也已被研究为变压器替代方案。在本文中，我们考虑了一个用于语言模型预训练的图形循环网络，该网络通过本地令牌级通信为每个序列构建一个图形结构，以及与其他代币解耦的句子级表示。原始模型在监督培训下的特定领域特定文本分类中表现良好，但是，其通过自我监督的方式学习转移知识的潜力尚未得到充分利用。我们通过优化体系结构并验证其在更通用的语言理解任务（英语和中文）中的有效性来填补这一空白。至于模型效率，我们的模型在基于变压器的模型中而不是二次复杂性，而是具有线性复杂性，并且在推断过程中的性能更有效。此外，我们发现，与现有基于注意力的模型相比，我们的模型可以产生更多样化的输出，而上下文化功能冗余。

Transformer-based pre-trained models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism inside Transformer may not be necessary, both convolutional neural networks and multi-layer perceptron based models have also been investigated as Transformer alternatives. In this paper, we consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications, together with a sentence-level representation decoupled from other tokens. The original model performs well in domain-specific text classification under supervised training, however, its potential in learning transfer knowledge by self-supervised way has not been fully exploited. We fill this gap by optimizing the architecture and verifying its effectiveness in more general language understanding tasks, for both English and Chinese languages. As for model efficiency, instead of the quadratic complexity in Transformer-based models, our model has linear complexity and performs more efficiently during inference. Moreover, we find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题