论文标题
指标是否在衡量应该什么?图像字幕任务指标的评估
Are metrics measuring what they should? An evaluation of image captioning task metrics
论文作者
论文摘要
图像字幕是当前的研究任务,用于使用场景中的对象及其关系来描述图像内容。为了解决这项任务,两个重要的研究领域汇聚,人工视觉和自然语言处理。在图像字幕中,就像在任何计算智能任务中一样,性能指标对于知道方法的性能(或坏)至关重要。近年来,已经观察到,基于n-gram的经典指标不足以捕获语义和关键含义来描述图像中的内容。为了衡量或不使用近期指标的集合,我们对几种图像字幕的指标进行了评估,并使用众所周知的COCO数据集进行了比较。这些指标是从先前工程中最常用的,它们是基于$ n $ grams的,例如bleu,sacrebleu,流星,流氓-L,苹果酒,香料,以及基于嵌入(例如bertscore and clipscore)的那些。为此,我们设计了两种情况。 1)一组人工构建具有多种品质的字幕,2)对某些最新图像字幕方法的比较。发现有趣的发现试图回答问题:当前的指标是否有助于产生高质量的标题?实际指标如何相比?指标真正测量什么?
Image Captioning is a current research task to describe the image content using the objects and their relationships in the scene. To tackle this task, two important research areas converge, artificial vision, and natural language processing. In Image Captioning, as in any computational intelligence task, the performance metrics are crucial for knowing how well (or bad) a method performs. In recent years, it has been observed that classical metrics based on n-grams are insufficient to capture the semantics and the critical meaning to describe the content in an image. Looking to measure how well or not the set of current and more recent metrics are doing, in this article, we present an evaluation of several kinds of Image Captioning metrics and a comparison between them using the well-known MS COCO dataset. The metrics were selected from the most used in prior works, they are those based on $n$-grams as BLEU, SacreBLEU, METEOR, ROGUE-L, CIDEr, SPICE, and those based on embeddings, such as BERTScore and CLIPScore. For this, we designed two scenarios; 1) a set of artificially build captions with several qualities, and 2) a comparison of some state-of-the-art Image Captioning methods. Interesting findings were found trying to answer the questions: Are the current metrics helping to produce high-quality captions? How do actual metrics compare to each other? What are the metrics really measuring?