图像字幕通过紧凑的双向体系结构

论文标题

图像字幕通过紧凑的双向体系结构

Image Captioning via Compact Bidirectional Architecture

论文作者

Song, Zijie, Zhou, Yuanen, Hu, Zhenzhen, Liu, Daqing, Ben, Huixia, Hong, Richang, Wang, Meng

论文摘要

大多数当前图像字幕模型通常从左到右生成字幕。这种单向属性使它们只能利用过去的上下文，而不能利用未来的上下文。尽管基于改进的模型可以通过在第一阶段的预先退回或预先生成的字幕上生成新的字幕来利用过去和将来的上下文，但这些模型的解码器通常由两个网络组成〜（即，在第一阶段的检索或明星在第一阶段，在第二阶段中是一个字幕，仅在第二阶段），仅执行了频繁的频片。在本文中，我们引入了一个紧凑的双向变压器模型，用于图像字幕，该模型可以隐式和显式地呈现双向上下文，同时可以执行解码器。具体而言，它是通过紧密耦合从左到右（L2R）和左至右（R2L）流入单个紧凑型模型来实现的，可以作为隐式利用双向上下文的正规化，并选择地允许从l2r或r2l flow中选择最终的双向流量，以明确的交互。我们对MSCOCO基准进行了广泛的消融研究，发现紧凑的双向架构和句子级集合比显式相互作用机制更重要。通过与单词级的合奏无缝结合，句子级合奏的效果将进一步扩大。我们进一步将传统的一流训练训练扩展到该体系结构下的两流量版本，并与非视语语言预处理模型相比，实现了新的最先进的结果。最后，我们通过将其扩展到LSTM骨架来验证这种紧凑的双向架构的通用性。源代码可从https://github.com/yuanezhou/cbtic获得。

Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to serve as a regularization for implicitly exploiting bidirectional context and optionally allowing explicit interaction of the bidirectional flows, while the final caption is chosen from either L2R or R2L flow in a sentence-level ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact bidirectional architecture and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow self-critical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Finally, we verify the generality of this compact bidirectional architecture by extending it to LSTM backbone. Source code is available at https://github.com/YuanEZhou/cbtic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题