论文标题
探索严重的闭塞:多人3D姿势估计带有门控卷积
Exploring Severe Occlusion: Multi-Person 3D Pose Estimation with Gated Convolution
论文作者
论文摘要
3D人类姿势估计(HPE)在许多领域至关重要,例如人类行为分析,增强现实/虚拟现实(AR/VR)应用以及自动驾驶行业。在现实世界中,包含从自由移动的单眼相机中捕获的多个潜在遮障人的视频非常普遍,而对于这种情况,3D HPE却非常具有挑战性,部分是因为缺乏在现有数据集中具有准确的3D地面真实标签的此类数据。在本文中,我们提出了一个具有封闭式卷积模块的时间回归网络,以将2D接头转换为3D,并在此期间恢复缺失的遮挡关节。进一步进行了一种简单而有效的定位方法,以将归一化姿势转化为全局轨迹。为了验证我们的方法的有效性,我们还收集了一个新的移动相机多人(MMHUMAN)数据集,其中包括多个通过移动摄像机捕获的重闭塞的人。 3D地面真相接头由精确的运动捕获(MOCAP)系统提供。从基于静态相机的人360万数据和我们自己的基于移动相机的数据的实验中,我们表明,我们所提出的方法的表现优于大多数最新的2dto-3D姿势姿势估计方法,尤其是对于严重遮挡的情况。
3D human pose estimation (HPE) is crucial in many fields, such as human behavior analysis, augmented reality/virtual reality (AR/VR) applications, and self-driving industry. Videos that contain multiple potentially occluded people captured from freely moving monocular cameras are very common in real-world scenarios, while 3D HPE for such scenarios is quite challenging, partially because there is a lack of such data with accurate 3D ground truth labels in existing datasets. In this paper, we propose a temporal regression network with a gated convolution module to transform 2D joints to 3D and recover the missing occluded joints in the meantime. A simple yet effective localization approach is further conducted to transform the normalized pose to the global trajectory. To verify the effectiveness of our approach, we also collect a new moving camera multi-human (MMHuman) dataset that includes multiple people with heavy occlusion captured by moving cameras. The 3D ground truth joints are provided by accurate motion capture (MoCap) system. From the experiments on static-camera based Human3.6M data and our own collected moving-camera based data, we show that our proposed method outperforms most state-of-the-art 2D-to-3D pose estimation methods, especially for the scenarios with heavy occlusions.