论文标题

低误差密度域中的语法误差校正:新的基准和分析

Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses

论文作者

Flachs, Simon, Lacroix, Ophélie, Yannakoudakis, Helen, Rei, Marek, Søgaard, Anders

论文摘要

语法误差校正(GEC)系统的评估主要集中在非母语学习者撰写的论文上,但是,这只是GEC应用的完整范围的一部分。我们旨在扩大GEC的目标域和发布CWEB,这是GEC的新基准,该基准是由英语说话者生成的网站文本,该网站文本的水平各不相同。网站数据是一个常见且重要的领域,包含的语法错误比学习者论文少得多,我们向最先进的GEC系统提出了挑战。我们证明,这背后的一个因素是系统无法依靠低误差密度域中的强大内部语言模型。我们希望这项工作应促进开发域GEC模型的发展,该模型推广到不同的主题和流派。

Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源