对金字塔的双重注意图图像字幕

论文标题

对金字塔的双重注意图图像字幕

Dual Attention on Pyramid Feature Maps for Image Captioning

论文作者

Yu, Litao, Zhang, Jian, Wu, Qiang

论文摘要

从图像中产生自然句子是多媒体中视觉语义理解的一项基本学习任务。在本文中，我们建议对金字塔图像特征图应用双重注意，以充分探索视觉语义相关性并提高生成的句子的质量。具体而言，在全面考虑了RNN控制器隐藏状态提供的上下文信息时，金字塔的注意力可以更好地将图像中的视觉指示性和语义一致区域定位。另一方面，上下文信息可以通过学习渠道依赖性来帮助重新校准特征组件的重要性，从而提高视觉特征的判别能力以获得更好的内容描述。我们在三个著名的数据集上进行了全面的实验：FlickR8K，Flickr30k和MS Coco，在从图像中产生描述性和光滑的自然句子方面取得了令人印象深刻的结果。使用卷积视觉功能或更有信息的自下而上注意力特征，我们的复合字幕模型在单模模式下实现了非常有希望的性能。提出的金字塔注意力和双重注意方法是高度模块化的，可以将其插入各种图像字幕模块中，以进一步提高性能。

Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in generating descriptive and smooth natural sentences from images. Using either convolution visual features or more informative bottom-up attention features, our composite captioning model achieves very promising performance in a single-model mode. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules to further improve the performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题