论文标题
标题中有什么?数据集特定的语言多样性及其对视觉描述模型和指标的影响
What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics
论文作者
论文摘要
尽管自动化视频描述领域取得了很大的收益,但自动描述模型对新领域的概括性能仍然是在现实世界中使用这些系统的主要障碍。已知大多数视觉描述方法都可以在训练数据中捕获和利用模式,从而导致评估度量增加,但是这些模式是什么?在这项工作中,我们检查了几个流行的视觉描述数据集,并捕获,分析和理解模型利用但不推广到新领域的数据集特异性语言模式。在令牌级别,示例级别和数据集级别上,我们发现字幕多样性是产生通用和非信息性字幕的主要驱动因素。我们进一步表明,最先进的模型甚至超过了现代指标上的地面真相标题,而这种效果是数据集中语言多样性的伪像。了解这种语言多样性是构建强大字幕模型的关键,我们建议在收集新数据中保持多样性的多种方法和方法,并在使用当前模型和指标时处理有限多样性的后果。
While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this effect is an artifact of linguistic diversity in datasets. Understanding this linguistic diversity is key to building strong captioning models, we recommend several methods and approaches for maintaining diversity in the collection of new data, and dealing with the consequences of limited diversity when using current models and metrics.