字幕生成的句子对句子的视觉语义相似性：经验教训

论文标题

字幕生成的句子对句子的视觉语义相似性：经验教训

Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned

论文作者

Sabir, Ahmed

论文摘要

本文着重于增强图像捕获生成系统产生的字幕。我们提出了一种通过选择与图像的最紧密相关的输出，而不是模型产生的最可能的输出来改善字幕生成系统的方法。我们的模型从视觉上下文的角度修改了语言生成的输出光束搜索。我们采用单词和句子级别的视觉语义度量，以将适当的标题与图像中的相关信息匹配。提出的方法可以作为基于后处理的方法应用于任何字幕系统。

This paper focuses on enhancing the captions generated by image-caption generation systems. We propose an approach for improving caption generation systems by choosing the most closely related output to the image rather than the most likely output produced by the model. Our model revises the language generation output beam search from a visual context perspective. We employ a visual semantic measure in a word and sentence level manner to match the proper caption to the related information in the image. The proposed approach can be applied to any caption system as a post-processing based method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题