论文标题

NEJM-ENZH:生物医学领域中英语翻译的平行语料库

NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain

论文作者

Liu, Boxiang, Huang, Liang

论文摘要

机器翻译需要大量的平行文本。尽管此类数据集在Newswire等领域中很丰富,但它们在生物医学领域中的访问较差。中文和英语是口语最广泛的两种语言,但据我们所知,这种语言对不存在生物医学领域中的平行语料库。在这项研究中,我们开发了一条有效的管道来获取和处理英语 - 中国平行语料库,其中包括大约100,000个句子对和每一侧的3,000,000个令牌,来自《新英格兰医学杂志》(NEJM)。我们表明,对室外数据的培训和微调的培训少于4,000个NEJM句子对,将翻译质量提高了25.3(13.4)BLEU,以EN $ \至$ ZH(ZH $ \至$ en)方向。翻译质量在较大的域数据集上的速度较慢,而整个数据集则增加了33.0(24.3)BLEU(ZH $ \ to $ en)指示。

Machine translation requires large amounts of parallel text. While such datasets are abundant in domains such as newswire, they are less accessible in the biomedical domain. Chinese and English are two of the most widely spoken languages, yet to our knowledge a parallel corpus in the biomedical domain does not exist for this language pair. In this study, we develop an effective pipeline to acquire and process an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM). We show that training on out-of-domain data and fine-tuning with as few as 4,000 NEJM sentence pairs improve translation quality by 25.3 (13.4) BLEU for en$\to$zh (zh$\to$en) directions. Translation quality continues to improve at a slower pace on larger in-domain datasets, with an increase of 33.0 (24.3) BLEU for en$\to$zh (zh$\to$en) directions on the full dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源