几乎没有置换的动作识别，并引起关注

论文标题

几乎没有置换的动作识别，并引起关注

Few-shot Action Recognition with Permutation-invariant Attention

论文作者

Zhang, Hongguang, Zhang, Li, Qi, Xiaojuan, Li, Hongdong, Torr, Philip H. S., Koniusz, Piotr

论文摘要

许多少数几个学习模型都集中在识别图像上。相比之下，我们解决了一项艰巨的任务，即视频中几乎没有动作识别。我们以C3D编码器的形式构建，用于时空视频块，以捕获短程动作模式。此类编码的块通过置换不变的池进行汇总，以使我们的方法鲁棒性，以不同于动作长度和远程时间依赖性，即使在同一类的剪辑中，这些模式也不太可能重复。随后，将合并的表示形式合并为简单的关系描述符，这些描述符编码所谓的查询和支持剪辑。最后，将关系描述符馈送到比较器中，以查询和支持剪辑之间的相似性学习的目标。重要的是，要在合并过程中重新权威贡献，我们利用空间和时间关注模块和自学意义。在自然主义的剪辑（同一类）中存在时间分布变化 - 歧视性时间动作热点的位置各不相同。因此，我们将夹子的块置入夹子，并将所得的注意区与非渗透夹的注意区域相似，以训练注意机制不变（以及长期的热点）排列。我们的方法的表现优于HMDB51，UCF101，Minimit数据集上的最新技术。

Many few-shot learning models focus on recognising images. In contrast, we tackle a challenging task of few-shot action recognition from videos. We build on a C3D encoder for spatio-temporal video blocks to capture short-range action patterns. Such encoded blocks are aggregated by permutation-invariant pooling to make our approach robust to varying action lengths and long-range temporal dependencies whose patterns are unlikely to repeat even in clips of the same class. Subsequently, the pooled representations are combined into simple relation descriptors which encode so-called query and support clips. Finally, relation descriptors are fed to the comparator with the goal of similarity learning between query and support clips. Importantly, to re-weight block contributions during pooling, we exploit spatial and temporal attention modules and self-supervision. In naturalistic clips (of the same class) there exists a temporal distribution shift--the locations of discriminative temporal action hotspots vary. Thus, we permute blocks of a clip and align the resulting attention regions with similarly permuted attention regions of non-permuted clip to train the attention mechanism invariant to block (and thus long-term hotspot) permutations. Our method outperforms the state of the art on the HMDB51, UCF101, miniMIT datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题