与反事实至关重要的多机构学习的非自动回归图像字幕

论文标题

与反事实至关重要的多机构学习的非自动回归图像字幕

Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning

论文作者

Guo, Longteng, Liu, Jing, Zhu, Xinxin, He, Xingjian, Jiang, Jie, Lu, Hanqing

论文摘要

大多数图像字幕模型都是自动回归的，即它们通过以前生成的单词来生成每个单词，从而导致推理期间的潜伏期很大。最近，已经在机器翻译中提出了非解放解码，以通过并行生成所有单词来加快推理时间。通常，这些模型使用单词级别的跨凝结损失来独立优化每个单词。但是，这样的学习过程未能考虑句子级的一致性，从而导致这些非自动性模型的劣质产生质量。在本文中，我们提出了具有新颖的训练范式的非自动性图像字幕（NAIC）模型：针对性的 - 关键性多主体学习（CMAL）。 CMAL将NAIC作为一种多代理增强学习系统，在该系统中，目标序列中的位置被视为学会学会最大程度地提高句子级别的奖励的代理。此外，我们建议利用大量未标记的图像来提高字幕性能。在MSCOCO图像字幕上进行的广泛实验表明，我们的NAIC模型达到了与最新的自回旋模型相当的性能，而带来了13.9倍的解码加速。

Most image captioning models are autoregressive, i.e. they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. Recently, non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel. Typically, these models use the word-level cross-entropy loss to optimize each word independently. However, such a learning process fails to consider the sentence-level consistency, thus resulting in inferior generation quality of these non-autoregressive models. In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL). CMAL formulates NAIC as a multi-agent reinforcement learning system where positions in the target sequence are viewed as agents that learn to cooperatively maximize a sentence-level reward. Besides, we propose to utilize massive unlabeled images to boost captioning performance. Extensive experiments on MSCOCO image captioning benchmark show that our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.

下载PDF全文

下载文献需遵守相关版权规定

论文标题