论文标题
XF2T:低资源语言的跨语性事实与文本生成
XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages
论文作者
论文摘要
多种业务场景需要从结构化输入数据中自动生成描述性的人类可读文本。因此,已经为各种下游任务开发了事实到文本的生成系统,例如生成足球报告,天气和财务报告,医疗报告,人类传记等。不幸的是,以前关于事实与文本(F2T)一代的工作主要集中在英语上,主要是由于相关数据集的高度供应。直到最近,提出了跨语言事实与文本(XF2T)的问题,该问题与多种语言以及八种语言的Xalign一起生成。但是,实际上XF2T生成问题没有严格的工作。我们将Xalign数据集扩展到另外四种语言的注释数据:旁遮普语,马拉雅拉姆语,阿萨姆语和oriya。我们在扩展的多语言数据集中使用流行的基于变压器的文本生成模型进行了广泛的研究,我们称之为Xalignv2。此外,我们研究了不同文本生成策略的性能:预处理,事实感知的嵌入和结构感知的输入编码的多种变化。我们的广泛实验表明,使用带有结构意识的输入编码的事实感知的嵌入式的多语言MT5模型,可以平均在十二种语言中获得最佳结果。我们将代码,数据集和模型公开可用,并希望这将有助于进一步在这一关键领域进行进一步的研究。
Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. Hence, fact-to-text generation systems have been developed for various downstream tasks like generating soccer reports, weather and financial reports, medical reports, person biographies, etc. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XALIGN for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XALIGN dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XALIGNV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.