论文标题

PETCI:平行的英语翻译数据集

PETCI: A Parallel English Translation Dataset of Chinese Idioms

论文作者

Tang, Kenan

论文摘要

成语是中文的重要语言现象,但众所周知,成语翻译很难。当前的机器翻译模型在成语翻译上的表现较差,而在许多翻译数据集中,字体稀疏。我们提出了Petci,这是一个平行的英语翻译数据集的中国成语数据集,旨在改善人类和机器的成语翻译。该数据集是通过利用人力和机器工作来构建的。基线生成模型表现出不满意的能力改善翻译,但是结构感知的分类模型在区分良好的翻译方面表现出良好的性能。此外,没有专业知识,可以轻松增加PETCI的大小。总体而言,PETCI可能有助于语言学习者和机器翻译系统。

Idioms are an important language phenomenon in Chinese, but idiom translation is notoriously hard. Current machine translation models perform poorly on idiom translation, while idioms are sparse in many translation datasets. We present PETCI, a parallel English translation dataset of Chinese idioms, aiming to improve idiom translation by both human and machine. The dataset is built by leveraging human and machine effort. Baseline generation models show unsatisfactory abilities to improve translation, but structure-aware classification models show good performance on distinguishing good translations. Furthermore, the size of PETCI can be easily increased without expertise. Overall, PETCI can be helpful to language learners and machine translation systems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源