使用变压器进行3D手姿势估计的学习顺序上下文

论文标题

使用变压器进行3D手姿势估计的学习顺序上下文

Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation

论文作者

Khaleghi, Leyla, Marshall, Joshua, Etemad, Ali

论文摘要

3D手姿势估计（HPE）是从任何视觉输入中将手关节定位在3D中的过程。由于HPE在各种人类计算机相互作用应用中的关键作用，HPE最近受到了更多的关注。最近的HPE方法证明了采用视频或多视图图像的优势，从而允许更强大的HPE系统。因此，在这项研究中，我们提出了一种新方法，用变压器进行手工姿势（塞思孔）估计进行顺序学习。我们的sethpose管道首先是从单个手图像中提取视觉嵌入。然后，我们使用变压器编码器沿时间或查看角度学习顺序上下文，并生成准确的2D手关节位置。然后，使用具有U-NET配置的图形卷积神经网络将2D手关节位置转换为3D姿势。我们的实验表明，sethpose在颞叶和角度的两个手动序列品种上表现良好。此外，Sethpose在现场的其他方法上的表现优于实现两个公共可用顺序数据集STB和Muvihand的新最新结果。

3D hand pose estimation (HPE) is the process of locating the joints of the hand in 3D from any visual input. HPE has recently received an increased amount of attention due to its key role in a variety of human-computer interaction applications. Recent HPE methods have demonstrated the advantages of employing videos or multi-view images, allowing for more robust HPE systems. Accordingly, in this study, we propose a new method to perform Sequential learning with Transformer for Hand Pose (SeTHPose) estimation. Our SeTHPose pipeline begins by extracting visual embeddings from individual hand images. We then use a transformer encoder to learn the sequential context along time or viewing angles and generate accurate 2D hand joint locations. Then, a graph convolutional neural network with a U-Net configuration is used to convert the 2D hand joint locations to 3D poses. Our experiments show that SeTHPose performs well on both hand sequence varieties, temporal and angular. Also, SeTHPose outperforms other methods in the field to achieve new state-of-the-art results on two public available sequential datasets, STB and MuViHand.

下载PDF全文

下载文献需遵守相关版权规定

论文标题