论文标题

一致:未配对的跨语义图像字幕

UNISON: Unpaired Cross-lingual Image Captioning

论文作者

Gao, Jiahui, Zhou, Yi, Yu, Philip L. H., Joty, Shafiq, Gu, Jiuxiang

论文摘要

由于其广泛的应用方案,近年来,图像字幕已成为一个有趣的研究领域。图像字幕的传统范式依赖于配对的图像捕获数据集以监督的方式训练该模型。但是,为每种目标语言创建这样的配对数据集非常昂贵,这阻碍了字幕技术的可扩展性,并剥夺了世界上很大一部分人口的利益。在这项工作中,我们提出了一种新颖的未配对的跨语性方法,用于生成图像字幕,而无需依赖源或目标语言中的任何字幕语料库。具体而言,我们的方法由两个阶段组成:(i)跨语言自动编码过程,它利用句子并行(bitext)语料库学习从源到目标语言中的目标语言的映射到目标语言中的目标语言,并在目标语言中解释句子,以及(ii)跨模式的语言绘制图形映射的映射,从而绘制了映射的映射,该图像映射了映射的绘图,该图像绘制了编码的编码映射,该图像绘制了编码的映射。映射映射的绘制映射。我们验证了我们提出的方法对中国图像字幕生成任务的有效性。与几种现有方法的比较证明了我们方法的有效性。

Image captioning has emerged as an interesting research field in recent years due to its broad application scenarios. The traditional paradigm of image captioning relies on paired image-caption datasets to train the model in a supervised manner. However, creating such paired datasets for every target language is prohibitively expensive, which hinders the extensibility of captioning technology and deprives a large part of the world population of its benefit. In this work, we present a novel unpaired cross-lingual method to generate image captions without relying on any caption corpus in the source or the target language. Specifically, our method consists of two phases: (i) a cross-lingual auto-encoding process, which utilizing a sentence parallel (bitext) corpus to learn the mapping from the source to the target language in the scene graph encoding space and decode sentences in the target language, and (ii) a cross-modal unsupervised feature mapping, which seeks to map the encoded scene graph features from image modality to language modality. We verify the effectiveness of our proposed method on the Chinese image caption generation task. The comparisons against several existing methods demonstrate the effectiveness of our approach.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源