通过预训练的语言模型进行平行语料库过滤

论文标题

通过预训练的语言模型进行平行语料库过滤

Parallel Corpus Filtering via Pre-trained Language Models

论文作者

Zhang, Boliang, Nagesh, Ajay, Knight, Kevin

论文摘要

Web Crawled数据为培训机器翻译模型提供了良好的并行语料库来源。它是自动获得的，但极为嘈杂，最近的工作表明，与传统的统计机器翻译方法相比，神经机器翻译系统对噪声更敏感。在本文中，我们提出了一种新颖的方法，可以通过预先训练的语言模型从网络爬行的语料库中滤除嘈杂的句子对。我们通过利用BERT的多语言能力来测量句子并行性，并使用生成预训练（GPT）语言模型作为域滤波器来平衡数据域。我们在WMT 2018平行语料库过滤共享任务以及我们自己的Web crawscraw crawsenese-parallel coppus上评估了所提出的方法。我们的方法极大地胜过基线，并实现了新的最先进。在无监督的环境中，我们的方法与Top-1监督方法达到了可比的性能。我们还评估了我们公开可用的一个网上爬行的日语平行语料库。

Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models. We measure sentence parallelism by leveraging the multilingual capability of BERT and use the Generative Pre-training (GPT) language model as a domain filter to balance data domains. We evaluate the proposed method on the WMT 2018 Parallel Corpus Filtering shared task, and on our own web-crawled Japanese-Chinese parallel corpus. Our method significantly outperforms baselines and achieves a new state-of-the-art. In an unsupervised setting, our method achieves comparable performance to the top-1 supervised method. We also evaluate on a web-crawled Japanese-Chinese parallel corpus that we make publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题