骨架：自我监督骨骼动作识别的时空蒙面自动编码器

论文标题

骨架：自我监督骨骼动作识别的时空蒙面自动编码器

SkeletonMAE: Spatial-Temporal Masked Autoencoders for Self-supervised Skeleton Action Recognition

论文作者

Wu, Wenhan, Hua, Yilei, Zheng, Ce, Wu, Shiqian, Chen, Chen, Lu, Aidong

论文摘要

通过深度学习技术的开花，完全有监督的基于骨架的动作识别取得了巨大的进步。但是，这些方法需要足够的标记数据，这不容易获得。相比之下，基于自我监督的骨骼的动作识别引起了更多的关注。通过利用未标记的数据，可以学会更多可概括的功能来减轻过度拟合的问题并减少大量标记的培训数据的需求。受到MAE的启发，我们提出了一个空间式蒙面的自动编码器框架，用于基于3D的3D骨架识别（Skeletonmae）。在MAE的掩蔽和重建管道之后，我们利用基于骨架的编码器变形金刚结构来重建蒙版的骨架序列。一种新颖的掩蔽策略，称为时空掩蔽，是根据骨架序列的联合级别和框架级别引入的。这种预训练策略使编码器输出可推广的骨骼特征具有空间和时间依赖性。鉴于未掩盖的骨架序列，编码器用于动作识别任务。广泛的实验表明，我们的骨架达到了出色的性能，并在NTU RGB+D和NTU RGB+D 120数据集上均超过了最先进的方法。

Fully supervised skeleton-based action recognition has achieved great progress with the blooming of deep learning techniques. However, these methods require sufficient labeled data which is not easy to obtain. In contrast, self-supervised skeleton-based action recognition has attracted more attention. With utilizing the unlabeled data, more generalizable features can be learned to alleviate the overfitting problem and reduce the demand of massive labeled training data. Inspired by the MAE, we propose a spatial-temporal masked autoencoder framework for self-supervised 3D skeleton-based action recognition (SkeletonMAE). Following MAE's masking and reconstruction pipeline, we utilize a skeleton-based encoder-decoder transformer architecture to reconstruct the masked skeleton sequences. A novel masking strategy, named Spatial-Temporal Masking, is introduced in terms of both joint-level and frame-level for the skeleton sequence. This pre-training strategy makes the encoder output generalizable skeleton features with spatial and temporal dependencies. Given the unmasked skeleton sequence, the encoder is fine-tuned for the action recognition task. Extensive experiments show that our SkeletonMAE achieves remarkable performance and outperforms the state-of-the-art methods on both NTU RGB+D and NTU RGB+D 120 datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题