论文标题

INDICMT评估:印度语言的元评估机器翻译指标的数据集

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

论文作者

Sai, Ananya B., Nagarajan, Vignesh, Dixit, Tanay, Dabre, Raj, Kunchukuttan, Anoop, Kumar, Pratyush, Khapra, Mitesh M.

论文摘要

机器翻译(MT)系统的快速增长需要全面的研究来使用元评估评估指标,从而可以更好地选择最能反映MT质量的指标。不幸的是,大多数研究都集中在高资源语言上,主要是英语,这些观察结果可能并不总是适用于其他语言。印度语言具有超过十亿个说话者的语言,在语言上与英语不同,迄今为止,还没有系统地研究将MT系统从英语到印度语言的评估。在本文中,我们通过创建一个由7000个细粒注释,涵盖5种印度语言和7吨系统的MQM数据集来填补这一空白,并使用它来建立注释者分数与使用现有自动指标获得的分数之间的相关性。我们的结果表明,预训练的指标(例如彗星)与注释者分数的相关性最高。此外,我们发现指标不能充分捕获印度语言中的基于流利的错误,并且有必要开发以印度语言的指标。我们希望我们的数据集和分析将有助于促进该领域的进一步研究。

The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源