语义角色意识到文本到视频检索的相关变压器

论文标题

语义角色意识到文本到视频检索的相关变压器

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

论文作者

Satar, Burak, Zhu, Hongyuan, Bresson, Xavier, Lim, Joo Hwee

论文摘要

随着社交媒体的出现，每天都会上传大量的视频剪辑，并使用语言查询检索最相关的视觉内容变得至关重要。大多数方法旨在学习纯文本和视觉内容的联合嵌入空间，而无需充分利用其模式内结构和模式间相关性。本文提出了一种新颖的变压器，该变压器将文本和视频明确地分解为具有注意力方案的对象，空间上下文和时间上下文的语义角色，以学习三个角色之间的内部和角色间相关性，以发现以不同级别匹配的歧视性特征。流行的YouCook2的初步结果表明，我们的方法超过了当前的最新方法，所有指标的利润很高。它还可以用两个指标覆盖两种SOTA方法。

With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels. The preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics. It also overpasses two SOTA methods in terms of two metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题