论文标题
E2S2:编码增强的序列到序列,以了解语言理解和产生
E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation
论文作者
论文摘要
序列到序列(SEQ2SEQ)学习是大规模预处理模型的流行方式。但是,先前的SEQ2SEQ预审进模型通常集中于解码器侧的重建目标,而忽略了编码器侧监督的效果,我们认为这可能会导致次优性能。为了验证我们的假设,我们首先在SEQ2SEQ预处理的语言模型中凭经验研究了编码器和解码器的功能,并发现编码器比关于下游性能和神经元激活的解码器相比,编码器具有重要但不足的作用。因此,我们提出了一种编码增强的SEQ2SEQ预处理策略,即E2S2,该策略通过将更有效的自我监督信息集成到编码器中来改善SEQ2SEQ模型。具体而言,E2S2从两个方面从编码器一侧采用了两个自制目标:1)本地剥夺了损坏的句子(denoing insuning obsosive); 2)全球学习更好的句子表示(对比目标)。借助两个目标,编码器可以有效地区分噪声令牌并捕获高级(即句法和语义)知识,从而增强了SEQ2SEQ模型准确实现条件产生的能力。在大量的下游自然语言理解和发电任务上,E2S2主要改善了其强大的骨干模型的性能,例如BART和T5。例如,在巴特骨架上,我们在通用语言理解评估(胶)基准和 +1.75%F_0.5分数提高的CONLL2014数据集上获得了 +1.1%平均增益。我们还提供了深入的分析,以表明改进来自更好的语言代表。我们希望我们的工作能够促进对SEQ2SEQ语言模型进行预处理的未来自学研究。
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. However, the prior seq2seq pretraining models generally focus on reconstructive objectives on the decoder side and neglect the effect of encoder-side supervision, which we argue may lead to sub-optimal performance. To verify our hypothesis, we first empirically study the functionalities of the encoder and decoder in seq2seq pretrained language models, and find that the encoder takes an important but under-exploitation role than the decoder regarding the downstream performance and neuron activation. Therefore, we propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2, which improves the seq2seq models via integrating more efficient self-supervised information into the encoders. Specifically, E2S2 adopts two self-supervised objectives on the encoder side from two aspects: 1) locally denoising the corrupted sentence (denoising objective); and 2) globally learning better sentence representations (contrastive objective). With the help of both objectives, the encoder can effectively distinguish the noise tokens and capture high-level (i.e., syntactic and semantic) knowledge, thus strengthening the ability of seq2seq model to accurately achieve the conditional generation. On a large diversity of downstream natural language understanding and generation tasks, E2S2 dominantly improves the performance of its powerful backbone models, e.g., BART and T5. For example, upon BART backbone, we achieve +1.1% averaged gain on the general language understanding evaluation (GLUE) benchmark and +1.75% F_0.5 score improvement on CoNLL2014 dataset. We also provide in-depth analyses to show the improvement stems from better linguistic representation. We hope that our work will foster future self-supervision research on seq2seq language model pretraining.