Onealigner：零射击跨语性转移，使用一对富裕的语言对，用于低资源句子检索

论文标题

Onealigner：零射击跨语性转移，使用一对富裕的语言对，用于低资源句子检索

OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval

论文作者

Niu, Tong, Hashimoto, Kazuma, Zhou, Yingbo, Xiong, Caiming

论文摘要

多语言语料库中的平行句子对齐对于策划下游应用程序（例如机器翻译）的数据至关重要。在这项工作中，我们介绍了Onealigner，这是一个专门为句子检索任务设计的对齐模型。该模型只能以跨语言的方式训练一种语言对，并以跨语言的方式进行训练，而性能的降级可以忽略不计。当用大规模平行多语言语料库（OPUS-100）的所有语言对培训时，该模型在Tateoba数据集中实现了最新的结果，以优于同等大小的先前模型的准确度，而使用其并行数据的0.6％。当对单个丰富的资源语言对（无论是否以英语为中心）上进行填充时，我们的模型能够匹配所有语言对列出的在同一数据预算下的填充且准确性降低2.0点的性能。此外，使用相同的设置，扩大富库语言对的扩展可以单调地提高性能，准确性至少达到0.4点的差异，从而减少了收集任何低资源并行数据的强制性。最后，我们通过经验结果得出结论，并分析句子对准任务的性能主要取决于单语言和并行数据的大小，最大为一定尺寸的阈值，而不是用于培训或评估的语言对。

Aligning parallel sentences in multilingual corpora is essential to curating data for downstream applications such as Machine Translation. In this work, we present OneAligner, an alignment model specially designed for sentence retrieval tasks. This model is able to train on only one language pair and transfers, in a cross-lingual fashion, to low-resource language pairs with negligible degradation in performance. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result on the Tateoba dataset, outperforming an equally-sized previous model by 8.0 points in accuracy while using less than 0.6% of their parallel data. When finetuned on a single rich-resource language pair, be it English-centered or not, our model is able to match the performance of the ones finetuned on all language pairs under the same data budget with less than 2.0 points decrease in accuracy. Furthermore, with the same setup, scaling up the number of rich-resource language pairs monotonically improves the performance, reaching a minimum of 0.4 points discrepancy in accuracy, making it less mandatory to collect any low-resource parallel data. Finally, we conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size, up to a certain size threshold, rather than on what language pairs are used for training or evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题