论文标题

语义角色意识到文本到视频检索的相关变压器

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

论文作者

Satar, Burak, Zhu, Hongyuan, Bresson, Xavier, Lim, Joo Hwee

论文摘要

随着社交媒体的出现,每天都会上传大量的视频剪辑,并使用语言查询检索最相关的视觉内容变得至关重要。大多数方法旨在学习纯文本和视觉内容的联合嵌入空间,而无需充分利用其模式内结构和模式间相关性。本文提出了一种新颖的变压器,该变压器将文本和视频明确地分解为具有注意力方案的对象,空间上下文和时间上下文的语义角色,以学习三个角色之间的内部和角色间相关性,以发现以不同级别匹配的歧视性特征。流行的YouCook2的初步结果表明,我们的方法超过了当前的最新方法,所有指标的利润很高。它还可以用两个指标覆盖两种SOTA方法。

With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels. The preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics. It also overpasses two SOTA methods in terms of two metrics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源