NAT：可靠神经序列标记的噪音感知训练

论文标题

NAT：可靠神经序列标记的噪音感知训练

NAT: Noise-Aware Training for Robust Neural Sequence Labeling

论文作者

Namysl, Marcin, Behnke, Sven, Köhler, Joachim

论文摘要

序列标记系统不仅应在理想条件下而且在损坏的输入下可靠地执行 - 因为这些系统经常处理用户生成的文本，或遵循容易出错的上游组件。为此，我们制定了嘈杂的序列标记问题，其中输入可能会经历未知的no噪声过程，并提出了两个噪音吸引人的训练（NAT）目标，以改善在受扰输入上执行的序列标记的稳健性：我们的数据增强方法在我们的数据增强方法上使用清洁和噪音的样品进行噪音训练，并在我们的稳定性训练中训练Algor，并在噪音中训练，并在噪音中训练Algor，并在噪音中训练。表示。我们在训练时采用香草噪声模型。为了进行评估，我们同时使用原始数据及其变体与实际的OCR错误和拼写错误一样。对英语和德语命名实体识别基准的广泛实验证实，NAT始终提高了流行序列标签模型的鲁棒性，从而确保了原始输入的准确性。我们将我们的代码和数据公开用于研究社区。

Sequence labeling systems should perform reliably not only under ideal conditions but also with corrupted inputs - as these systems often process user-generated text or follow an error-prone upstream component. To this end, we formulate the noisy sequence labeling problem, where the input may undergo an unknown noising process and propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on perturbed input: Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation. We employ a vanilla noise model at training time. For evaluation, we use both the original data and its variants perturbed with real OCR errors and misspellings. Extensive experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models, preserving accuracy on the original input. We make our code and data publicly available for the research community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题