论文标题
掩盖正确的令牌:一种令人尴尬的简单纠正方法
Mask the Correct Tokens: An Embarrassingly Simple Approach for Error Correction
论文作者
论文摘要
文本错误校正旨在纠正文本序列中的错误,例如由人类键入或由语音识别模型生成的错误。以前的错误校正方法通常以编码器输入为源(不正确)句子,并通过解码器生成目标(正确)句子。由于错误句子的错误率通常很低(例如10 \%),因此校正模型只能在有限的错误令牌上学习纠正,而在大多数令牌(正确的令牌)上琐碎地复制,这会损害有效的误差校正培训。在本文中,我们认为应该更好地利用正确的代币来促进有效的培训,然后提出一种简单而有效的掩盖策略来实现这一目标。具体来说,我们将源句子中正确令牌的一部分随机掩盖,让模型不仅要纠正原始错误令牌,还可以根据其上下文信息来预测蒙面的令牌。我们的方法具有多种优势:1)减轻了琐碎的副本; 2)它利用正确令牌的有效培训信号; 3)它是一个插件模块,可以应用于不同的型号和任务。关于拼写误差校正和语音识别误差校正的实验,并在英语数据集上使用自回归和非自动回调的生成模型对英语数据集进行了语法校正校正,这表明我们的方法一致地提高了校正精度。
Text error correction aims to correct the errors in text sequences such as those typed by humans or generated by speech recognition models. Previous error correction methods usually take the source (incorrect) sentence as encoder input and generate the target (correct) sentence through the decoder. Since the error rate of the incorrect sentence is usually low (e.g., 10\%), the correction model can only learn to correct on limited error tokens but trivially copy on most tokens (correct tokens), which harms the effective training of error correction. In this paper, we argue that the correct tokens should be better utilized to facilitate effective training and then propose a simple yet effective masking strategy to achieve this goal. Specifically, we randomly mask out a part of the correct tokens in the source sentence and let the model learn to not only correct the original error tokens but also predict the masked tokens based on their context information. Our method enjoys several advantages: 1) it alleviates trivial copy; 2) it leverages effective training signals from correct tokens; 3) it is a plug-and-play module and can be applied to different models and tasks. Experiments on spelling error correction and speech recognition error correction on Mandarin datasets and grammar error correction on English datasets with both autoregressive and non-autoregressive generation models show that our method improves the correction accuracy consistently.