论文标题
Sescore2:通过综合现实错误学习文本生成评估
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes
论文作者
论文摘要
是否可以培训一般指标,以评估文本生成质量而无需人类注释的评级?现有的学习指标要么在文本生成任务中执行不令人满意的指标,要么需要人类评级来培训特定任务。在本文中,我们提出了SESCORE2,这是一种自我监督的方法,用于培训基于模型的指标用于文本生成评估。关键概念是通过扰动从语料库检索的句子来综合现实模型错误。 Sescore2的主要优点是它易于扩展到许多其他语言,同时提供了可靠的严重性估算。我们对三种语言的四个文本生成任务评估了sescore2和以前的方法。 SESCORE2在四个文本生成评估基准方面优于无监督的棱镜,肯德尔的改善为0.078。令人惊讶的是,Sescore2甚至在多个文本生成任务上胜过受监督的爆炸和彗星。代码和数据可在https://github.com/xu1998hz/sescore2上找到。
Is it possible to train a general metric for evaluating text generation quality without human annotated ratings? Existing learned metrics either perform unsatisfactorily across text generation tasks or require human ratings for training on specific tasks. In this paper, we propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. The key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. The primary advantage of the SESCORE2 is its ease of extension to many other languages while providing reliable severity estimation. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages. SESCORE2 outperforms unsupervised metric PRISM on four text generation evaluation benchmarks, with a Kendall improvement of 0.078. Surprisingly, SESCORE2 even outperforms the supervised BLEURT and COMET on multiple text generation tasks. The code and data are available at https://github.com/xu1998hz/SEScore2.