与对比度学习的视频检索的时间上下文聚合

论文标题

与对比度学习的视频检索的时间上下文聚合

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

论文作者

Shao, Jie, Wen, Xin, Zhao, Bingchen, Xue, Xiangyang

论文摘要

当前的研究重点是基于内容的视频检索，需要更高级别的视频表示形式，描述了相关事件，事件等的长距离语义依赖性。但是，现有方法通常将视频框架作为单个图像或短剪辑处理，从而使长期语义依赖性的建模变得困难。在本文中，我们提出了TCA（视频检索的时间上下文聚合），这是一个视频表示学习框架，该框架使用自我注意力的机制在框架级特征之间包含远程时间信息。为了在视频检索数据集中训练它，我们提出了一种有监督的对比学习方法，该方法执行自动硬性挖掘，并利用存储库机制来增加负样本的容量。在多个视频检索任务上进行了广泛的实验，例如CC_WEB_VIDEO，FIVR-200K和EVVE。提出的方法比具有视频级特征的最先进方法显示出显着的性能优势（FIVR-200K上的17％地图），并与框架级别的功能相比，推理时间更快地提供了竞争性结果。

The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process the frames of a video as individual images or short clips, making the modeling of long-range semantic dependencies difficult. In this paper, we propose TCA (Temporal Context Aggregation for Video Retrieval), a video representation learning framework that incorporates long-range temporal information between frame-level features using the self-attention mechanism. To train it on video retrieval datasets, we propose a supervised contrastive learning method that performs automatic hard negative mining and utilizes the memory bank mechanism to increase the capacity of negative samples. Extensive experiments are conducted on multiple video retrieval tasks, such as CC_WEB_VIDEO, FIVR-200K, and EVVE. The proposed method shows a significant performance advantage (~17% mAP on FIVR-200K) over state-of-the-art methods with video-level features, and deliver competitive results with 22x faster inference time comparing with frame-level features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题