双级脱钩变压器用于视频字幕

论文标题

双级脱钩变压器用于视频字幕

Dual-Level Decoupled Transformer for Video Captioning

论文作者

Gao, Yiqi, Hou, Xinglin, Suo, Wei, Sun, Mengyang, Ge, Tiezheng, Jiang, Yuning, Wang, Peng

论文摘要

视频字幕旨在了解视频的时空语义概念并产生描述性句子。此任务的事实方法决定了一个文本生成器，以从\ textit {offline-Ixpracted}运动或从\ textit {预训练}视觉模型中学习。但是，这些方法可能会遭受所谓的\ textbf {\ textit {“ ably”}}的缺点，这两个\ textit {video textio spatio-tomporal表示}和\ textit {句子{句子generation}。对于前者，\ textbf {\ textIt {“ abl”}}表示在单个模型（3DCNN）中学习时空表示，从而导致了\ emph {task/pre-train train域中的disconnection的问题}和\ emph {难以进行终端到末端训练}。至于后者，\ textbf {\ textIt {“ abl”}}意味着平等地处理视觉语义和语法相关的单词的生成。 To this end, we present $\mathcal{D}^{2}$ - a dual-level decoupled transformer pipeline to solve the above drawbacks: \emph{(i)} for video spatio-temporal representation, we decouple the process of it into "first-spatial-then-temporal" paradigm, releasing the potential of using dedicated模型（\ textit {例如{例如}图像文本预训练）以连接前训练和下游任务，并使整个模型端到端可训练。 \ emph {（ii）}对于句子生成，我们提出\ emph {语法 - 敏感解码器}，以动态测量视觉语义和与语法相关的单词的贡献。对三个广泛使用的基准测试（MSVD，MSR-VTT和VATEX）进行了广泛的实验，显示了所提出的$ \ Mathcal {d}^{2} $的巨大潜力，并通过视频字幕任务的大幅度超过了先前的方法。

Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from \textit{offline-extracted} motion or appearance features from \textit{pre-trained} vision models. However, these methods may suffer from the so-called \textbf{\textit{"couple"}} drawbacks on both \textit{video spatio-temporal representation} and \textit{sentence generation}. For the former, \textbf{\textit{"couple"}} means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named \emph{disconnection in task/pre-train domain} and \emph{hard for end-to-end training}. As for the latter, \textbf{\textit{"couple"}} means treating the generation of visual semantic and syntax-related words equally. To this end, we present $\mathcal{D}^{2}$ - a dual-level decoupled transformer pipeline to solve the above drawbacks: \emph{(i)} for video spatio-temporal representation, we decouple the process of it into "first-spatial-then-temporal" paradigm, releasing the potential of using dedicated model(\textit{e.g.} image-text pre-training) to connect the pre-training and downstream tasks, and makes the entire model end-to-end trainable. \emph{(ii)} for sentence generation, we propose \emph{Syntax-Aware Decoder} to dynamically measure the contribution of visual semantic and syntax-related words. Extensive experiments on three widely-used benchmarks (MSVD, MSR-VTT and VATEX) have shown great potential of the proposed $\mathcal{D}^{2}$ and surpassed the previous methods by a large margin in the task of video captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题