与遥远的监督一部分数据对文本生成

论文标题

与遥远的监督一部分数据对文本生成

Partially-Aligned Data-to-Text Generation with Distant Supervision

论文作者

Fu, Zihao, Shi, Bei, Lam, Wai, Bing, Lidong, Liu, Zhiyuan

论文摘要

数据到文本任务旨在生成可读文本，以描述一些给定的结构化数据，从而实现了更多的解释性。但是，典型的一代任务仅限于一些特定的域，因为它需要良好的数据，这很难获得。使用部分对准数据是解决数据集稀缺问题的另一种方法。由于可以自动生成这种数据，因此可以更容易获取。但是，使用这种数据会引起过度生成问题，这对于现有模型带来了困难，这在生成过程中倾向于增加无关的摘录。为了有效利用自动注释的部分分配的数据集，我们将传统的生成任务扩展到一个精致的任务，称为部分分配的数据对文本生成（PADTG），这是更实用的，因为它利用了自动注释数据进行培训，并扩展了应用程序域。为了解决这项新任务，我们提出了一个新颖的遥远监督生成框架。首先，它使用估算器估算输入数据对每个目标词的支持性，然后将支持适配器和重新平衡的光束搜索应用于分别在培训和发电阶段中的过度生成问题。我们还贡献了一个部分分配的数据集（本文的数据和源代码可以从https://github.com/fuzihaofzh/distant_supervision_nlg获得，通过对Wikipedia进行采样句子，并从Wikipedia中抽样，并自动从Wikidate All Diverniper中提取了exterper necter exterper frol virection。利用部分对准数据的可行性。

The Data-to-Text task aims to generate human-readable text for describing some given structured data enabling more interpretability. However, the typical generation task is confined to a few particular domains since it requires well-aligned data which is difficult and expensive to obtain. Using partially-aligned data is an alternative way of solving the dataset scarcity problem. This kind of data is much easier to obtain since it can be produced automatically. However, using this kind of data induces the over-generation problem posing difficulties for existing models, which tends to add unrelated excerpts during the generation procedure. In order to effectively utilize automatically annotated partially-aligned datasets, we extend the traditional generation task to a refined task called Partially-Aligned Data-to-Text Generation (PADTG) which is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. To tackle this new task, we propose a novel distant supervision generation framework. It firstly estimates the input data's supportiveness for each target word with an estimator and then applies a supportiveness adaptor and a rebalanced beam search to harness the over-generation problem in the training and generation phases respectively. We also contribute a partially-aligned dataset (The data and source code of this paper can be obtained from https://github.com/fuzihaofzh/distant_supervision_nlg by sampling sentences from Wikipedia and automatically extracting corresponding KB triples for each sentence from Wikidata. The experimental results show that our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题