探索图像字幕的离散扩散模型

论文标题

探索图像字幕的离散扩散模型

Exploring Discrete Diffusion Models for Image Captioning

论文作者

Zhu, Zixin, Wei, Yixuan, Wang, Jianfeng, Gan, Zhe, Zhang, Zheng, Wang, Le, Hua, Gang, Wang, Lijuan, Liu, Zicheng, Hu, Han

论文摘要

图像字幕任务通常是通过自动回归方法来实现的，该方法将文本令牌一个一个一个一个一个一个解码。我们提出了一个基于扩散的字幕模型，称为DDCAP名称，以允许更多解码的灵活性。与图像生成不同的是，输出是连续且冗余的，图像标题中的文本是分类的，长度变化。因此，如我们的实验所示，将离散扩散模型天真地应用于文本解码并不能很好地工作。为了解决性能差距，我们提出了几种关键技术，包括最佳推理，集中注意力面罩，文本长度预测和无图像训练。在没有额外标题预训练的COCO上，它的苹果酒得分为117.8，比在受控设置中具有相同体系结构的自动回归基线+5.0。在标题填充任务上，它还比自动回归基线（230.3 V.S.203.5）高26.8分。借助4m视觉的预训练图像和基本大小的模型，我们在可可的苹果酒中达到了125.1，这与发达良好发达的自动回归框架具有竞争力。该代码可在https://github.com/buxiangzhiren/ddcap上找到。

The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image generation, where the output is continuous and redundant with a fixed length, texts in image captions are categorical and short with varied lengths. Therefore, naively applying the discrete diffusion model to text decoding does not work well, as shown in our experiments. To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. On COCO without additional caption pre-training, it achieves a CIDEr score of 117.8, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting. It also performs +26.8 higher CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption infilling task. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the best well-developed auto-regressive frameworks. The code is available at https://github.com/buxiangzhiren/DDCap.

下载PDF全文

下载文献需遵守相关版权规定

论文标题