论文标题

双级脱钩变压器用于视频字幕

Dual-Level Decoupled Transformer for Video Captioning

论文作者

Gao, Yiqi, Hou, Xinglin, Suo, Wei, Sun, Mengyang, Ge, Tiezheng, Jiang, Yuning, Wang, Peng

论文摘要

视频字幕旨在了解视频的时空语义概念并产生描述性句子。此任务的事实方法决定了一个文本生成器,以从\ textit {offline-Ixpracted}运动或从\ textit {预训练}视觉模型中学习。但是,这些方法可能会遭受所谓的\ textbf {\ textit {“ ably”}}的缺点,这两个\ textit {video textio spatio-tomporal表示}和\ textit {句子{句子generation}。对于前者,\ textbf {\ textIt {“ abl”}}表示在单个模型(3DCNN)中学习时空表示,从而导致了\ emph {task/pre-train train域中的disconnection的问题}和\ emph {难以进行终端到末端训练}。至于后者,\ textbf {\ textIt {“ abl”}}意味着平等地处理视觉语义和语法相关的单词的生成。 To this end, we present $\mathcal{D}^{2}$ - a dual-level decoupled transformer pipeline to solve the above drawbacks: \emph{(i)} for video spatio-temporal representation, we decouple the process of it into "first-spatial-then-temporal" paradigm, releasing the potential of using dedicated模型(\ textit {例如{例如}图像文本预训练)以连接前训练和下游任务,并使整个模型端到端可训练。 \ emph {(ii)}对于句子生成,我们提出\ emph {语法 - 敏感解码器},以动态测量视觉语义和与语法相关的单词的贡献。对三个广泛使用的基准测试(MSVD,MSR-VTT和VATEX)进行了广泛的实验,显示了所提出的$ \ Mathcal {d}^{2} $的巨大潜力,并通过视频字幕任务的大幅度超过了先前的方法。

Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from \textit{offline-extracted} motion or appearance features from \textit{pre-trained} vision models. However, these methods may suffer from the so-called \textbf{\textit{"couple"}} drawbacks on both \textit{video spatio-temporal representation} and \textit{sentence generation}. For the former, \textbf{\textit{"couple"}} means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named \emph{disconnection in task/pre-train domain} and \emph{hard for end-to-end training}. As for the latter, \textbf{\textit{"couple"}} means treating the generation of visual semantic and syntax-related words equally. To this end, we present $\mathcal{D}^{2}$ - a dual-level decoupled transformer pipeline to solve the above drawbacks: \emph{(i)} for video spatio-temporal representation, we decouple the process of it into "first-spatial-then-temporal" paradigm, releasing the potential of using dedicated model(\textit{e.g.} image-text pre-training) to connect the pre-training and downstream tasks, and makes the entire model end-to-end trainable. \emph{(ii)} for sentence generation, we propose \emph{Syntax-Aware Decoder} to dynamically measure the contribution of visual semantic and syntax-related words. Extensive experiments on three widely-used benchmarks (MSVD, MSR-VTT and VATEX) have shown great potential of the proposed $\mathcal{D}^{2}$ and surpassed the previous methods by a large margin in the task of video captioning.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源