论文标题
探索与匹配:基于桥接提案的桥接和提案,与变压器无用于视频中的句子接地
Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos
论文作者
论文摘要
自然语言视频接地(NLVG)旨在根据句子查询将时间段定位在未修剪的视频中。在这项工作中,我们提出了一个名为NLVG的探索和匹配的新范式,该范式无缝地统一了NLVG方法的两种水流的优势:基于提案和基于提案的方法;前者探索了直接找到时间段的搜索空间,而后者将预定义的时间段与地面真相相匹配。为了实现这一目标,我们将NLVG作为设定的预测问题,并设计一个端到端的可训练语言视频变压器(LVTR),可以享受两个有利的属性,它们是丰富的上下文化功率和平行解码。我们以两次损失训练LVTR。首先,时间定位损失允许所有查询的时间段回归目标(探索)。其次,将指导丢失设置为每个查询与各自的目标(匹配)。令我们惊讶的是,我们发现培训时间表显示了类似于分裂的模式:无论目标是什么,时间段首先是多元化的,然后与每个目标结合,然后再次调整到目标。此外,LVTR高效和有效:它比以前的基线(2倍或更多)更快,并在两个NLVG基准(ActivityCaptions and Charades-STA)上设定竞争结果。代码可在https://github.com/sangminwoo/explore-and-match上找到。
Natural Language Video Grounding (NLVG) aims to localize time segments in an untrimmed video according to sentence queries. In this work, we present a new paradigm named Explore-And-Match for NLVG that seamlessly unifies the strengths of two streams of NLVG methods: proposal-free and proposal-based; the former explores the search space to find time segments directly, and the latter matches the predefined time segments with ground truths. To achieve this, we formulate NLVG as a set prediction problem and design an end-to-end trainable Language Video Transformer (LVTR) that can enjoy two favorable properties, which are rich contextualization power and parallel decoding. We train LVTR with two losses. First, temporal localization loss allows time segments of all queries to regress targets (explore). Second, set guidance loss couples every query with their respective target (match). To our surprise, we found that training schedule shows divide-and-conquer-like pattern: time segments are first diversified regardless of the target, then coupled with each target, and fine-tuned to the target again. Moreover, LVTR is highly efficient and effective: it infers faster than previous baselines (by 2X or more) and sets competitive results on two NLVG benchmarks (ActivityCaptions and Charades-STA). Codes are available at https://github.com/sangminwoo/Explore-And-Match.