论文标题

FutureHuman3d:从视频观察中预测复杂的复杂长期3D人类行为

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

论文作者

Diller, Christian, Funkhouser, Thomas, Dai, Angela

论文摘要

我们提出了一种生成的方法,可以预测3D的长期未来人类行为,仅需要易于获得的2D人类行动数据的弱监督。这是实现许多下游应用程序的基本任务。所需的地面真相数据很难在3D(MOCAP西装,昂贵的设置)中捕获,但在2D(简单的RGB摄像头)中易于获取。因此,我们设计我们的方法仅在推理时仅需要2D RGB数据,同时能够生成3D人体运动序列。我们以自回旋方式使用可区分的2D投影方案来进行弱监督,并对3D正则化的对抗性损失。我们的方法预测了由多个子行为组成的长而复杂的人类行为序列(例如,烹饪,装配)。我们以语义分层的方式解决这个问题,共同预测高级粗糙作用标签以及其低水平的细颗粒实现作为特征3D人类姿势。我们观察到,这两种行动表示在本质上是耦合的,并且共同的预测既有益于行动和姿势预测。我们的实验证明了联合作用和3D姿势预测的互补性质:我们的联合方法的表现优于单独处理的每个任务,实现了可靠的长期序列预测,并改善了预测动作和特征3D姿势的替代方法。

We present a generative approach to forecast long-term future human behavior in 3D, requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits, expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus, we design our method to only require 2D RGB data at inference time while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision, and an adversarial loss for 3D regularization. Our method predicts long and complex human behavior sequences (e.g., cooking, assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner, jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature, and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually, enables robust longer-term sequence prediction, and improves over alternative approaches to forecast actions and characteristic 3D poses.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源