受控幻觉：学会从嘈杂的数据中忠实地产生

论文标题

受控幻觉：学会从嘈杂的数据中忠实地产生

Controlled Hallucinations: Learning to Generate Faithfully from Noisy Data

论文作者

Filippova, Katja

论文摘要

当培训数据丰富时，神经文本生成（数据或文本到文本）表现出了出色的性能，而对于许多应用程序而言，这并非如此。为了收集大量的并行数据，经常使用启发式规则，但它们不可避免地会让噪声进入数据，例如输出中无法通过输入解释的短语。因此，模型在噪音上拾起并可能幻觉 - 流利但不受支持的文本。我们的贡献是一种简单但功能强大的技术，可以将这些幻觉视为生成文本的可控方面，而无需放弃任何输入，而没有修改模型体系结构。在Wikibio语料库（Lebret等，2016）上，这是一个特别嘈杂的数据集，我们在自动和人类评估中都证明了该技术的功效。

Neural text generation (data- or text-to-text) demonstrates remarkable performance when training data is abundant which for many applications is not the case. To collect a large corpus of parallel data, heuristic rules are often used but they inevitably let noise into the data, such as phrases in the output which cannot be explained by the input. Consequently, models pick up on the noise and may hallucinate--generate fluent but unsupported text. Our contribution is a simple but powerful technique to treat such hallucinations as a controllable aspect of the generated text, without dismissing any input and without modifying the model architecture. On the WikiBio corpus (Lebret et al., 2016), a particularly noisy dataset, we demonstrate the efficacy of the technique both in an automatic and in a human evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题