论文标题
几乎没有置换的动作识别,并引起关注
Few-shot Action Recognition with Permutation-invariant Attention
论文作者
论文摘要
许多少数几个学习模型都集中在识别图像上。相比之下,我们解决了一项艰巨的任务,即视频中几乎没有动作识别。我们以C3D编码器的形式构建,用于时空视频块,以捕获短程动作模式。此类编码的块通过置换不变的池进行汇总,以使我们的方法鲁棒性,以不同于动作长度和远程时间依赖性,即使在同一类的剪辑中,这些模式也不太可能重复。随后,将合并的表示形式合并为简单的关系描述符,这些描述符编码所谓的查询和支持剪辑。最后,将关系描述符馈送到比较器中,以查询和支持剪辑之间的相似性学习的目标。重要的是,要在合并过程中重新权威贡献,我们利用空间和时间关注模块和自学意义。在自然主义的剪辑(同一类)中存在时间分布变化 - 歧视性时间动作热点的位置各不相同。因此,我们将夹子的块置入夹子,并将所得的注意区与非渗透夹的注意区域相似,以训练注意机制不变(以及长期的热点)排列。我们的方法的表现优于HMDB51,UCF101,Minimit数据集上的最新技术。
Many few-shot learning models focus on recognising images. In contrast, we tackle a challenging task of few-shot action recognition from videos. We build on a C3D encoder for spatio-temporal video blocks to capture short-range action patterns. Such encoded blocks are aggregated by permutation-invariant pooling to make our approach robust to varying action lengths and long-range temporal dependencies whose patterns are unlikely to repeat even in clips of the same class. Subsequently, the pooled representations are combined into simple relation descriptors which encode so-called query and support clips. Finally, relation descriptors are fed to the comparator with the goal of similarity learning between query and support clips. Importantly, to re-weight block contributions during pooling, we exploit spatial and temporal attention modules and self-supervision. In naturalistic clips (of the same class) there exists a temporal distribution shift--the locations of discriminative temporal action hotspots vary. Thus, we permute blocks of a clip and align the resulting attention regions with similarly permuted attention regions of non-permuted clip to train the attention mechanism invariant to block (and thus long-term hotspot) permutations. Our method outperforms the state of the art on the HMDB51, UCF101, miniMIT datasets.