语法错误校正的数据加权培训策略

论文标题

语法错误校正的数据加权培训策略

Data Weighted Training Strategies for Grammatical Error Correction

论文作者

Lichtarge, Jared, Alberti, Chris, Kumar, Shankar

论文摘要

语法误差校正任务（GEC）的最新进展是通过解决数据稀少度来驱动的，这是通过生成大型和嘈杂的预处理数据的新方法，以及通过BEA-2019共享任务中的小型和高质量的鉴定数据发布。在神经机器翻译（NMT）的最新工作的基础上，我们通过基于较小，更高质量的数据集的大型预处理数据来得出示例级别的分数来利用这两种数据。在这项工作中，我们进行了一项实证研究，以发现如何最好地将Delta-Log-Permemporxity（一种示例得分）纳入GEC的培训时间表。在此过程中，我们执行的实验阐明了Delta-Log-Perporxity的功能和适用性。经过评分数据训练的模型在常见的GEC测试集上实现了最先进的结果。

Recent progress in the task of Grammatical Error Correction (GEC) has been driven by addressing data sparsity, both through new methods for generating large and noisy pretraining data and through the publication of small and higher-quality finetuning data in the BEA-2019 shared task. Building upon recent work in Neural Machine Translation (NMT), we make use of both kinds of data by deriving example-level scores on our large pretraining data based on a smaller, higher-quality dataset. In this work, we perform an empirical study to discover how to best incorporate delta-log-perplexity, a type of example scoring, into a training schedule for GEC. In doing so, we perform experiments that shed light on the function and applicability of delta-log-perplexity. Models trained on scored data achieve state-of-the-art results on common GEC test sets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题