对齐源可视觉和目标语言域，用于未配对的视频字幕

论文标题

对齐源可视觉和目标语言域，用于未配对的视频字幕

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

论文作者

Liu, Fenglin, Wu, Xian, You, Chenyu, Ge, Shen, Zou, Yuexian, Sun, Xu

论文摘要

培训监督的视频字幕模型需要耦合视频捕获对。但是，对于许多有针对性的语言，没有足够的配对数据。为此，我们介绍了未配对的视频字幕字幕任务，目的是训练模型而无需以目标语言耦合视频映射。为了解决任务，一个自然的选择是采用两步管道系统：首先利用视频对私密字幕模型以枢轴语言生成字幕，然后利用枢轴到目标翻译模型将枢轴字幕转换为目标语言。但是，在这样的管道系统中，1）视觉信息无法达到翻译模型，从而产生视觉无关的目标标题； 2）生成的枢轴字幕中的错误将传播到翻译模型，从而导致目标标题不足。为了解决这些问题，我们建议使用视觉注射系统（UVC-VI）的不配对的视频字幕。 UVC-VI首先引入视觉注入模块（VIM），该模块将源视觉和目标语言域对齐以将源视觉信息注入目标语言域。同时，VIM直接连接视频间模型的编码器和枢轴到目标模型的解码器，从而通过完全跳过枢轴字幕的生成来允许端到端推断。为了增强VIM的跨模式注入，UVC-VI进一步引入了可插入的视频编码器，即多模式协作编码器（MCE）。实验表明，UVC-VI优于管道系统，超过了几个监督系统。此外，将现有的监督系统配备我们的MCE可以分别在基准MSVD和MSR-VTT数据集上的苹果分数上获得4％和7％的相对利润。

Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题