通过分层原子动作进行细粒度视频的弱监督的时间动作检测

论文标题

通过分层原子动作进行细粒度视频的弱监督的时间动作检测

Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions

论文作者

Li, Zhi, He, Lu, Xu, Huijuan

论文摘要

动作理解已经演变成精细粒度的时代，因为现实生活中的大多数人类行为只有很小的差异。为了以标签有效的方式准确检测这些细粒度的动作，我们首次解决了视频中弱监督的细粒度临时动作检测问题。如果没有仔细的设计来捕获细粒度的动作之间的细微差异，先前的一般动作检测模型在细粒度的环境中不能很好地表现。我们建议将动作建模为可重复使用的原子动作的组合，这些原子动作是通过自我监管的聚类自动从数据中发现的，以捕获细粒度动作的共同点和个性。以视觉概念为代表的学识渊博的原子动作进一步映射到利用语义标签层次结构的精细和粗糙的作用标签。我们的方法构建了四个级别的视觉表示层次结构：剪辑级别，原子动作级别，精细动作类别和粗糙的动作类别水平，并在每个级别进行监督。对两个大规模细颗粒视频数据集（Fineeaction和FineGym）进行了广泛的实验，显示了我们提出的弱监督模型的好处，以实现精细粒度的动作检测，并实现了最新的结果。

Action understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time. Without the careful design to capture subtle differences between fine-grained actions, previous weakly-supervised models for general action detection cannot perform well in the fine-grained setting. We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data through self-supervised clustering, in order to capture the commonality and individuality of fine-grained actions. The learnt atomic actions, represented by visual concepts, are further mapped to fine and coarse action labels leveraging the semantic label hierarchy. Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level. Extensive experiments on two large-scale fine-grained video datasets, FineAction and FineGym, show the benefit of our proposed weakly-supervised model for fine-grained action detection, and it achieves state-of-the-art results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题