基于语言规则的语料库生成中文语法错误校正

论文标题

基于语言规则的语料库生成中文语法错误校正

Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction

论文作者

Ma, Shirong, Li, Yinghui, Sun, Rongyi, Zhou, Qingyu, Huang, Shulin, Zhang, Ding, Yangning, Li, Liu, Ruiyang, Li, Zhongli, Cao, Yunbo, Zheng, Haitao, Shen, Ying

论文摘要

中国语法误差校正（CGEC）既是一项具有挑战性的NLP任务，又是人类日常生活中的常见应用。最近，提出了许多数据驱动的方法来开发CGEC研究。但是，CGEC领域有两个主要局限性：首先，缺乏高质量的注释培训语料库可防止现有CGEC模型的性能得到显着改善。其次，广泛使用的测试集中的语法错误不是由中国人说的本地人士构成的，从而导致CGEC模型与实际应用之间存在显着差距。在本文中，我们提出了一种基于语言规则的方法，用于构建具有自动产生的语法错误的大规模CGEC培训语料库。此外，我们提出了一个具有挑战性的CGEC基准，该基准完全源自现实世界中的中国人说的错误。广泛的实验和详细分析不仅表明我们方法构建的训练数据有效地改善了CGEC模型的性能，而且还反映了我们的基准是进一步开发CGEC领域的绝佳资源。

Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive experiments and detailed analyses not only demonstrate that the training data constructed by our method effectively improves the performance of CGEC models, but also reflect that our benchmark is an excellent resource for further development of the CGEC field.

下载PDF全文

下载文献需遵守相关版权规定

论文标题