用于NLG系统的评估指标的调查

论文标题

用于NLG系统的评估指标的调查

A Survey of Evaluation Metrics Used for NLG Systems

论文作者

Sai, Ananya B., Mohankumar, Akash Kumar, Khapra, Mitesh M.

论文摘要

深度学习的成功引起了人们对各种自然语言生成（NLG）任务的兴趣激增。深度学习不仅在几项现有的NLG任务中推动了最新技术，而且还促进了研究人员探索各种较新的NLG任务，例如图像字幕。 NLG中的这种快速进步需要开发准确的自动评估指标，这将使我们能够跟踪NLG领域的进度。但是，与分类任务不同，自动评估NLG系统本身是一个巨大的挑战。几项作品表明，早期基于启发式的指标，例如BLEU，Rouge不足以捕获不同NLG任务中的细微差别。自2014年以来，NLG模型数量不断扩大，导致提出的评估指标数量的迅速增加。此外，各种评估指标已从使用预定的基于启发式的启发式公式转变为训练有素的变压器模型。在相对较短的时间内，这种迅速的变化导致需要对现有NLG指标进行调查，以帮助现有的和新的研究人员迅速迅速迅速发展了过去几年中NLG评估中发生的发展。通过这项调查，我们首先希望强调自动评估NLG系统的挑战和困难。然后，我们为评估指标提供了连贯的分类法，以组织现有指标并更好地了解该领域的发展。我们还详细描述了不同的指标，并强调了它们的关键贡献。稍后，我们讨论了现有指标中确定的主要缺点，并描述用于评估评估指标的方法。最后，我们讨论了下一步的建议和建议，以改善自动评估指标。

The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks. Deep Learning has not only pushed the state of the art in several existing NLG tasks but has also facilitated researchers to explore various newer NLG tasks such as image captioning. Such rapid progress in NLG has necessitated the development of accurate automatic evaluation metrics that would allow us to track the progress in the field of NLG. However, unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge. Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG tasks. The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014. Moreover, various evaluation metrics have shifted from using pre-determined heuristic-based formulae to trained transformer models. This rapid change in a relatively short time has led to the need for a survey of the existing NLG metrics to help existing and new researchers to quickly come up to speed with the developments that have happened in NLG evaluation in the last few years. Through this survey, we first wish to highlight the challenges and difficulties in automatically evaluating NLG systems. Then, we provide a coherent taxonomy of the evaluation metrics to organize the existing metrics and to better understand the developments in the field. We also describe the different metrics in detail and highlight their key contributions. Later, we discuss the main shortcomings identified in the existing metrics and describe the methodology used to evaluate evaluation metrics. Finally, we discuss our suggestions and recommendations on the next steps forward to improve the automatic evaluation metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题