论文标题
先知的注意:预测注意力以未来的关注图像字幕
Prophet Attention: Predicting Attention with Future Attention for Image Captioning
论文作者
论文摘要
最近,基于注意力的模型已在许多序列到序列学习系统中广泛使用。特别是对于图像字幕,预期基于注意力的模型将以适当的生成单词进行正确的图像区域。但是,在解码过程中的每个时间步骤中,基于注意力的模型通常使用当前输入的隐藏状态参加图像区域。在这种情况下,这些注意力模型具有“偏离的焦点”问题,它们根据以前的单词而不是要生成的单词来计算注意力权重,从而损害了接地和字幕的性能。在本文中,我们提出了先知的关注,类似于自学的形式。在训练阶段,该模块利用未来的信息来计算针对图像区域的“理想”注意力。这些计算出的“理想”权重被进一步用于使“偏差”注意的正规化。以这种方式,图像区域以正确的单词为基础。提出的先知注意力可以轻松地纳入现有的图像字幕模型中,以提高其接地和字幕的性能。 FlickR30K实体和MSCOCO数据集的实验表明,拟议的先知注意力在自动指标和人类评估中始终优于基准。值得注意的是,我们在两个基准数据集上设置了新的最先进的方法,并就默认排名得分(即CIDER-C40)获得了在线MSCOCO基准测试的排行榜上的第一名。
Recently, attention based models have been used extensively in many sequence-to-sequence learning systems. Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of the current input to attend to the image regions. Under this setting, these attention models have a "deviated focus" problem that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the "ideal" attention weights towards image regions. These calculated "ideal" weights are further used to regularize the "deviated" attention. In this manner, image regions are grounded with the correct words. The proposed Prophet Attention can be easily incorporated into existing image captioning models to improve their performance of both grounding and captioning. The experiments on the Flickr30k Entities and the MSCOCO datasets show that the proposed Prophet Attention consistently outperforms baselines in both automatic metrics and human evaluations. It is worth noticing that we set new state-of-the-arts on the two benchmark datasets and achieve the 1st place on the leaderboard of the online MSCOCO benchmark in terms of the default ranking score, i.e., CIDEr-c40.