利用自然监督语言表示学习和产生

论文标题

利用自然监督语言表示学习和产生

Leveraging Natural Supervision for Language Representation Learning and Generation

论文作者

Chen, Mingda

论文摘要

自然语言处理（NLP）的最新突破是由接受大量纯文本训练的语言模型驱动的。尽管有力，但从文本资源中获得监督仍然是一个悬而未决的问题。例如，预审计的语言模型通常会忽略文本数据中丰富的，自由的结构。在本论文中，我们描述了三种工作，试图使用自然出现的监督来改善神经模型的训练和评估。我们首先研究了自我监督的训练损失，以帮助提高针对各种NLP任务的审慎语言模型的性能。具体而言，我们改变了句子预测损失，以使其更适合其他预处理损失，更具挑战性的解决。我们设计了一个中间的登录步骤，该步骤使用自我监督的训练来促进模型的交叉任务概括。然后，我们描述了利用维基百科和释义中的结构的方法。特别是，我们提出培训损失，以利用实体，话语 - 与涉及相关知识的超链接，文章结构和文章类别图。我们提出了一个框架，该框架使用释义对在句子表示中删除语义和语法。我们为新的生成任务扩展了框架，该任务可以用句子示例来控制输出文本的语法。最后，我们讨论了针对建立挑战性评估任务的文本资源定制的工作。我们通过使用各种粉丝限制的网站定义新任务来介绍三个数据集，包括长形式的数据对文本生成数据集，剧本摘要数据集和长格式的故事生成数据集。这些数据集具有独特的特征，为未来的任务设置提供了挑战。

Recent breakthroughs in Natural Language Processing (NLP) have been driven by language models trained on a massive amount of plain text. While powerful, deriving supervision from textual resources is still an open question. For example, language model pretraining often neglects the rich, freely-available structures in textual data. In this thesis, we describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision. We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks. Specifically, we alter the sentence prediction loss to make it better suited to other pretraining losses and more challenging to solve. We design an intermediate finetuning step that uses self-supervised training to promote models' ability in cross-task generalization. Then we describe methods to leverage the structures in Wikipedia and paraphrases. In particular, we propose training losses to exploit hyperlinks, article structures, and article category graphs for entity-, discourse-, entailment-related knowledge. We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations. We extend the framework for a novel generation task that controls the syntax of output text with a sentential exemplar. Lastly, we discuss our work on tailoring textual resources for establishing challenging evaluation tasks. We introduce three datasets by defining novel tasks using various fan-contributed websites, including a long-form data-to-text generation dataset, a screenplay summarization dataset, and a long-form story generation dataset. These datasets have unique characteristics offering challenges to future work in their respective task settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题