论文标题

Turjuman:一种用于神经阿拉伯语机器翻译的公共工具包

TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

论文作者

Nagoudi, El Moatez Billah, Elmadany, AbdelRahim, Abdul-Mageed, Muhammad

论文摘要

我们提出了Turjuman,这是一种神经工具包,可将20种语言转化为现代标准阿拉伯语(MSA)。 Turjuman利用了最近引入的文本到文本变压器ARAT5模型,使其具有强大的解码能力分解为阿拉伯语。该工具包提供了采用多种不同解码方法的可能性,使其适合获取MSA翻译的释义作为附加值。为了培训Turjuman,我们使用简单的语义相似性方法从公开可获得的并行数据中取样,以确保数据质量。这使我们可以准备和发布Araopus-20,这是一种新的机器翻译基准。我们公开发布了翻译工具包(Turjuman)以及基准数据集(Araopus-20)。

We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源