通过利用变压器网络中的内部和层间全局表示来改善图像字幕

论文标题

通过利用变压器网络中的内部和层间全局表示来改善图像字幕

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

论文作者

Ji, Jiayi, Luo, Yunpeng, Sun, Xiaoshuai, Chen, Fuhai, Luo, Gen, Wu, Yongjian, Gao, Yue, Ji, Rongrong

论文摘要

基于变压器的体系结构在图像字幕上显示出巨大的成功，在该图像字幕上，对象区域进行编码，然后参与矢量表示以指导字幕解码。但是，此类矢量表示仅包含区域级信息，而无需考虑反映整个图像的全局信息，这不会扩展图像字幕中复杂的多模式推理的能力。在本文中，我们引入了全球增强的变压器（称为GET），以提取更全面的全球表示形式，然后自适应地指导解码器生成高质量的字幕。在GET中，全球增强的编码器是为全球功能嵌入而设计的，并且为标题生成的指导而设计了全球自适应解码器。前者通过利用拟议的全局增强的注意力和层次融合模块来利用层间和间层间的全局表示形式。后者包含一个全球自适应控制器，可以将全局信息自适应地融合到解码器中，以指导字幕生成。对COCO数据集的广泛实验证明了我们获得的优越性超过了许多最先进的实验。

Transformer-based architectures have shown great success in image captioning, where object regions are encoded and then attended into the vectorial representations to guide the caption decoding. However, such vectorial representations only contain region-level information without considering the global information reflecting the entire image, which fails to expand the capability of complex multi-modal reasoning in image captioning. In this paper, we introduce a Global Enhanced Transformer (termed GET) to enable the extraction of a more comprehensive global representation, and then adaptively guide the decoder to generate high-quality captions. In GET, a Global Enhanced Encoder is designed for the embedding of the global feature, and a Global Adaptive Decoder are designed for the guidance of the caption generation. The former models intra- and inter-layer global representation by taking advantage of the proposed Global Enhanced Attention and a layer-wise fusion module. The latter contains a Global Adaptive Controller that can adaptively fuse the global information into the decoder to guide the caption generation. Extensive experiments on MS COCO dataset demonstrate the superiority of our GET over many state-of-the-arts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题