多模式视频字幕的端到端生成预处理

论文标题

多模式视频字幕的端到端生成预处理

End-to-end Generative Pretraining for Multimodal Video Captioning

论文作者

Seo, Paul Hongsuck, Nagrani, Arsha, Arnab, Anurag, Schmid, Cordelia

论文摘要

最近的视频和语言预处理框架缺乏生成句子的能力。我们提出了多模式的视频生成预处理（MV-GPT），这是一个新的预处理框架，用于从未标记的视频中学习，可有效地用于生成任务，例如多模式视频字幕。与最近的视频预科框架不同，我们的框架既训练多模式的视频编码器，又可以共同训练句子解码器。为了克服未标记视频中缺乏字幕的标题，我们利用未来的话语作为附加的文本来源，并提出了双向发电目标 - 鉴于当前的穆里特莫代语环境以及当前的话语，我们会产生未来的话语，并且鉴于未来的观察。有了这个目标，我们训练一个编码器模型的端到端，以直接从原始像素中产生字幕并直接转录语音。我们的模型实现了在四个标准基准上进行多模式视频字幕的最新性能，以及其他视频理解任务，例如VideoQA，视频检索和动作分类。

Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题