Intercap：相互作用中人类和物体的无标记3D跟踪

论文标题

Intercap：相互作用中人类和物体的无标记3D跟踪

InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction

论文作者

Huang, Yinghao, Tehari, Omid, Black, Michael J., Tzionas, Dimitrios

论文摘要

人类不断与日常对象互动以完成任务。为了了解这种相互作用，计算机需要从观察全身与场景的全身相互作用的相机中重建这些相互作用。由于身体与物体之间的阻塞，运动模糊，深度/比例模棱两可以及手和可抓手的物体部分的低图像分辨率，这是具有挑战性的。为了使问题可解决，社区要么专注于互动的手，忽略身体，要么忽略了互动的身体。 Grab数据集解决了灵活的全身交互，但使用基于标记的MoCap并缺少图像，而行为则捕获了身体对象互动的视频，但缺乏手动细节。我们使用参数全身模型SMPL-X和已知的对象网格来解决一种新的Intercap工作的局限性，该方法是一种新的方法，该方法可以重建从多视图RGB-D数据进行相互作用的整体和对象。为了应对上述挑战，Intercap使用了两个关键观察：（i）可以使用手与物体之间的接触来改善两者的姿势估计。（ii）Azure Kinect传感器使我们能够建立一个简单的多视图RGB-D捕获系统，该系统在提供合理的相机间同步时最小化遮挡的效果。使用此方法，我们捕获了Intercap数据集，其中包含10个受试者（5名男性和5名女性）与10个各种尺寸和负担能力的对象相互作用，包括与手或脚接触。 Intercap总共有223个RGB-D视频，产生了67,357个多视图帧，每个帧包含6个RGB-D图像。我们的方法为每个视频框架提供了伪真正的身体网格和对象。我们的Intercap方法和数据集填补了文献中的重要空白，并支持许多研究方向。我们的数据和代码可用于研究目的。

Humans constantly interact with daily objects to accomplish tasks. To understand such interactions, computers need to reconstruct these from cameras observing whole-body interaction with scenes. This is challenging due to occlusion between the body and objects, motion blur, depth/scale ambiguities, and the low image resolution of hands and graspable object parts. To make the problem tractable, the community focuses either on interacting hands, ignoring the body, or on interacting bodies, ignoring hands. The GRAB dataset addresses dexterous whole-body interaction but uses marker-based MoCap and lacks images, while BEHAVE captures video of body object interaction but lacks hand detail. We address the limitations of prior work with InterCap, a novel method that reconstructs interacting whole-bodies and objects from multi-view RGB-D data, using the parametric whole-body model SMPL-X and known object meshes. To tackle the above challenges, InterCap uses two key observations: (i) Contact between the hand and object can be used to improve the pose estimation of both. (ii) Azure Kinect sensors allow us to set up a simple multi-view RGB-D capture system that minimizes the effect of occlusion while providing reasonable inter-camera synchronization. With this method we capture the InterCap dataset, which contains 10 subjects (5 males and 5 females) interacting with 10 objects of various sizes and affordances, including contact with the hands or feet. In total, InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images. Our method provides pseudo ground-truth body meshes and objects for each video frame. Our InterCap method and dataset fill an important gap in the literature and support many research directions. Our data and code are areavailable for research purposes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题