RGB视频中以骨架为重点的人类活动识别

论文标题

RGB视频中以骨架为重点的人类活动识别

Skeleton Focused Human Activity Recognition in RGB Video

论文作者

Yu, Bruce X. B., Liu, Yan, Chan, Keith C. C.

论文摘要

数据驱动的方法可以了解诸如骨架帧或RGB视频之类的视觉功能的最佳表示，目前是活动识别的主要范式。尽管使用越来越大的数据集从现有的单模式方法中取得了重大改进，但很少尝试在功能级别上融合各种数据模式。在本文中，我们提出了一个多模式融合模型，该模型利用骨骼和RGB模式来推断人类活动。目的是通过有效利用不同数据模式之间的相互补充信息来提高活动识别精度。对于骨架方式，我们建议使用图形卷积子网学习骨骼表示。尽管对于RGB模式，我们将使用RGB视频中的时空感兴趣区域，并从骨架模式中获取注意力特征来指导学习过程。该模型可以以端到端方式通过后传播算法进行单独或统一的训练。 NTU-RGB+D和Northwestern-UCLA多视图数据集的实验结果实现了最新的性能，这表明RGB模态的拟议骨架驱动的注意机制增加了不同数据模式之间的相互通信，并为推断人类活动带来了更多歧视性的特征。

The data-driven approach that learns an optimal representation of vision features like skeleton frames or RGB videos is currently a dominant paradigm for activity recognition. While great improvements have been achieved from existing single modal approaches with increasingly larger datasets, the fusion of various data modalities at the feature level has seldom been attempted. In this paper, we propose a multimodal feature fusion model that utilizes both skeleton and RGB modalities to infer human activity. The objective is to improve the activity recognition accuracy by effectively utilizing the mutual complemental information among different data modalities. For the skeleton modality, we propose to use a graph convolutional subnetwork to learn the skeleton representation. Whereas for the RGB modality, we will use the spatial-temporal region of interest from RGB videos and take the attention features from the skeleton modality to guide the learning process. The model could be either individually or uniformly trained by the back-propagation algorithm in an end-to-end manner. The experimental results for the NTU-RGB+D and Northwestern-UCLA Multiview datasets achieved state-of-the-art performance, which indicates that the proposed skeleton-driven attention mechanism for the RGB modality increases the mutual communication between different data modalities and brings more discriminative features for inferring human activities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题