论文标题

MT指标与同时语音翻译的人类评分相关

MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

论文作者

Macháček, Dominik, Bojar, Ondřej, Dabre, Raj

论文摘要

关于人类评分与离线机器翻译(MT)评估指标(例如BLEU,CHRF2,BERTSCORE和COMET)之间相关性的相关性,已经进行了几项元评估研究。这些指标已用于评估同时的语音翻译(SST),但是它们与SST的人类评级的相关性(最近被收集为连续评级(CR))尚不清楚。在本文中,我们利用了IWSLT 2022上提交给英语 - 德国SST任务的候选系统的评估,并对CR和上述指标进行了广泛的相关性分析。我们的研究表明,离线指标与CR相关,可以可靠地用于以同时模式评估机器翻译,对测试集的大小有所限制。我们得出的结论是,鉴于当前的SST质量水平,这些指标可以用作CR的代理,从而减轻了对大型人类评估的需求。此外,我们观察到,指标与翻译作为参考的相关性明显高于同时解释,因此我们建议前者进行可靠的评估。

There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this paper, we leverage the evaluations of candidate systems submitted to the English-German SST task at IWSLT 2022 and conduct an extensive correlation analysis of CR and the aforementioned metrics. Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode, with some limitations on the test set size. We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation. Additionally, we observe that correlations of the metrics with translation as a reference is significantly higher than with simultaneous interpreting, and thus we recommend the former for reliable evaluation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源