论文标题
视频自动缝制图网络用于时间动作本地化
Video Self-Stitching Graph Network for Temporal Action Localization
论文作者
论文摘要
视频中的时间动作定位(TAL)是一项具有挑战性的任务,尤其是由于动作时间尺度的差异很大。简短的操作通常在数据集中占据一定比例,但性能最低。在本文中,我们面对简短动作的挑战,并提出了一种称为视频自动缝制图网络(VSGN)的多级跨尺度解决方案。我们在VSGN中有两个关键组件:视频自缝线(VSS)和跨尺度图形金字塔网络(XGPN)。在VSS中,我们专注于视频的短期,并沿时间维度放大,以获得更大的规模。我们将原始夹子及其放大的对应物拼成一个输入序列,以利用两个尺度的互补特性。 XGPN组件通过跨尺度图网络的金字塔进一步利用了跨尺度相关性,每个网络都包含混合模块,以从跨尺度以及在同一尺度内汇总特征。我们的VSGN不仅增强了功能表示形式,而且还为短暂的动作和更短的训练样本产生了更多积极的锚点。实验表明,VSGN显然改善了短暂动作的本地化性能,并在Thumos-14和ActivityNet-V1.3上实现了最先进的总体性能。
Temporal action localization (TAL) in videos is a challenging task, especially due to the large variation in action temporal scales. Short actions usually occupy a major proportion in the datasets, but tend to have the lowest performance. In this paper, we confront the challenge of short actions and propose a multi-level cross-scale solution dubbed as video self-stitching graph network (VSGN). We have two key components in VSGN: video self-stitching (VSS) and cross-scale graph pyramid network (xGPN). In VSS, we focus on a short period of a video and magnify it along the temporal dimension to obtain a larger scale. We stitch the original clip and its magnified counterpart in one input sequence to take advantage of the complementary properties of both scales. The xGPN component further exploits the cross-scale correlations by a pyramid of cross-scale graph networks, each containing a hybrid module to aggregate features from across scales as well as within the same scale. Our VSGN not only enhances the feature representations, but also generates more positive anchors for short actions and more short training samples. Experiments demonstrate that VSGN obviously improves the localization performance of short actions as well as achieving the state-of-the-art overall performance on THUMOS-14 and ActivityNet-v1.3.