论文标题
OCSAMPLER:将视频压缩到一个单步采样的一个剪辑中
OCSampler: Compressing Videos to One Clip with Single-step Sampling
论文作者
论文摘要
在本文中,我们提出了一个名为OCSampler的框架,以使用一个简短的剪辑来探索一个紧凑而有效的视频表示,以有效地识别视频识别。最近的作品宁愿通过根据其重要性选择一个框架来制定帧采样作为顺序决策任务,而我们提出了一种新的学习实例特定视频凝结策略的新范式,以选择信息框架,以便仅在单个步骤中代表整个视频。我们的基本动机是,有效的视频识别任务在于一次处理整个顺序,而不是顺序拾取帧。因此,这些策略是从轻加权的脱离网络中得出的,并在一步之内提供了一个简单而有效的政策网络。此外,我们使用帧数预算扩展了所提出的方法,从而使框架能够以尽可能少的框架产生正确的信心预测。在四个基准测试的实验,即活动网络,迷你运动,FCVID,迷你运动1M,证明了我们OCSampler在准确性,理论计算费用,实际推断速度方面的有效性。我们还评估了其跨不同分类器,采样帧和搜索空间的概括功率。尤其是,我们在ActivityNet上获得了76.9%的地图和21.7 Gflops,并具有令人印象深刻的吞吐量:在单个Titan XP GPU上的123.9视频/s。
In this paper, we propose a framework named OCSampler to explore a compact yet effective video representation with one short clip for efficient video recognition. Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially. Accordingly, these policies are derived from a light-weighted skim network together with a simple yet effective policy network within one step. Moreover, we extend the proposed method with a frame number budget, enabling the framework to produce correct predictions in high confidence with as few frames as possible. Experiments on four benchmarks, i.e., ActivityNet, Mini-Kinetics, FCVID, Mini-Sports1M, demonstrate the effectiveness of our OCSampler over previous methods in terms of accuracy, theoretical computational expense, actual inference speed. We also evaluate its generalization power across different classifiers, sampled frames, and search spaces. Especially, we achieve 76.9% mAP and 21.7 GFLOPs on ActivityNet with an impressive throughput: 123.9 Videos/s on a single TITAN Xp GPU.