使用常识性知识重建动作条件的人类对象相互作用

论文标题

使用常识性知识重建动作条件的人类对象相互作用

Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors

论文作者

Wang, Xi, Li, Gen, Kuo, Yen-Ling, Kocabas, Muhammed, Aksan, Emre, Hilliges, Otmar

论文摘要

我们提出了一种从图像中推断人类对象相互作用的3D模型的方法。考虑到人类如何与单个2D图像中复杂场景中的对象相互作用的推理是一项具有挑战性的任务，鉴于由于通过投影而导致信息丢失引起的歧义。此外，建模3D相互作用需要对各种对象类别和交互类型的概括能力。我们提出了一种相互作用的动作条件建模，使我们能够在接触区域或3D场景几何形状上推断人类和物体的不同3D安排。我们的方法从大语言模型（例如GPT-3）中提取高级常识性知识，并将其应用于对人类对象相互作用的3D推理。我们的主要见解是从大语言模型中提取的先验可以帮助仅从纹理提示中推理人类对象的联系。我们对大型人类对象交互数据集进行了定量评估推断的3D模型，并显示我们的方法如何导致更好的3D重建。我们进一步定性地评估了我们方法对真实图像的有效性，并证明了其对互动类型和对象类别的普遍性。

We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interaction types. We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts high-level commonsense knowledge from large language models (such as GPT-3), and applies them to perform 3D reasoning of human-object interactions. Our key insight is priors extracted from large language models can help in reasoning about human-object contacts from textural prompts only. We quantitatively evaluate the inferred 3D models on a large human-object interaction dataset and show how our method leads to better 3D reconstructions. We further qualitatively evaluate the effectiveness of our method on real images and demonstrate its generalizability towards interaction types and object categories.

下载PDF全文

下载文献需遵守相关版权规定

论文标题