对生成的文本的辩证法评估

论文标题

对生成的文本的辩证法评估

Dialect-robust Evaluation of Generated Text

论文作者

Sun, Jiao, Sellam, Thibault, Clark, Elizabeth, Vu, Tu, Dozat, Timothy, Garrette, Dan, Siddhant, Aditya, Eisenstein, Jacob, Gehrmann, Sebastian

论文摘要

对方言变化不稳定的评估指标使得无法判断系统对许多用户的性能的效果，甚至可以惩罚系统以低资源方言生成文本的系统。但是，目前，没有办法量化指标如何应对产生的话语方言的变化。因此，我们将方言鲁棒性和方言意识形式化为NLG评估指标的目标。我们引入了一套方法和相应的统计测试，可以根据两个目标来评估指标。将套件应用于当前的最新指标，我们证明它们不是方言持续的，并且语义扰动经常导致度量中的降低比引入方言功能较小。作为克服这一局限性的第一步，我们提出了培训模式，Nano，该模式将区域和语言信息介绍给指标预训练过程。我们证明，Nano为模型提供了一种尺寸效率的方法，以提高方言鲁棒性，同时提高其在标准度量基准上的性能。

Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题