通过姿势指导的粗到精细框架来解析零件级别的动作

论文标题

通过姿势指导的粗到精细框架来解析零件级别的动作

Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework

论文作者

Chen, Xiaodong, Liu, Xinchen, Liu, Wu, Liu, Kun, Wu, Dong, Zhang, Yongdong, Mei, Tao

论文摘要

视频的行动识别，即将视频分类为预定义的动作类型之一，一直是人工智能，多媒体和信号处理社区中的一个流行话题。但是，现有方法通常考虑一个整体上的输入视频并学习模型，例如卷积神经网络（CNNS），并带有粗糙的视频级别类标签。这些方法只能为视频输出一个动作类，但无法提供可解释的线索来回答为什么视频显示特定的动作。因此，研究人员开始专注于一项新任务，部分级别的动作解析（PAP），该工作不仅旨在预测视频级别的动作，而且还认识到视频中每个人的框架级别的细粒度的动作或身体部位的相互作用。为此，我们为这项具有挑战性的任务提出了一个粗到最新的框架。特别是，我们的框架首先预测了输入视频的视频级别类别，然后定位身体部位并预测零件级别的动作。此外，为了在部分级别的动作解析中平衡准确性和计算，我们建议通过段级特征来识别零件级的动作。此外，为了克服身体部位的歧义，我们提出了一种姿势引导的位置嵌入方法来准确地定位身体部位。通过在大规模数据集（即动力学TPS）上进行的全面实验，我们的框架实现了最先进的性能，并且超过31.10％的ROC得分的现有方法。

Action recognition from videos, i.e., classifying a video into one of the pre-defined action types, has been a popular topic in the communities of artificial intelligence, multimedia, and signal processing. However, existing methods usually consider an input video as a whole and learn models, e.g., Convolutional Neural Networks (CNNs), with coarse video-level class labels. These methods can only output an action class for the video, but cannot provide fine-grained and explainable cues to answer why the video shows a specific action. Therefore, researchers start to focus on a new task, Part-level Action Parsing (PAP), which aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video. To this end, we propose a coarse-to-fine framework for this challenging task. In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action. Moreover, to balance the accuracy and computation in part-level action parsing, we propose to recognize the part-level actions by segment-level features. Furthermore, to overcome the ambiguity of body parts, we propose a pose-guided positional embedding method to accurately localize body parts. Through comprehensive experiments on a large-scale dataset, i.e., Kinetics-TPS, our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题