多人3D姿势估计的置换不变的关系网络

论文标题

多人3D姿势估计的置换不变的关系网络

Permutation-Invariant Relational Network for Multi-person 3D Pose Estimation

论文作者

Ugrinovic, Nicolas, Ruiz, Adria, Agudo, Antonio, Sanfeliu, Alberto, Moreno-Noguer, Francesc

论文摘要

从单个RGB图像中恢复多人3D姿势是由于固有的2d-3d深度歧义，人际关系遮挡和身体截断而导致的严重条件问题。为了解决这些问题，最近的作品通过同时为不同的人推理了有希望的结果。但是，在大多数情况下，这是通过仅考虑成对的人互动来完成的，从而阻碍了能够捕获远程相互作用的整体场景表示。这是通过共同处理现场所有人的方法来解决的，尽管他们要求将其中一个定义为参考和预定的人订购，对此选择敏感。在本文中，我们克服了这两个局限性，并提出了一种多人3D姿势估计的方法，该方法独立于输入顺序捕获远程相互作用。为此，我们构建了一个类似残留的置换不变网络，该网络成功地完善了由现成的检测器估计的潜在损坏的初始3D姿势。剩余功能是通过SET变压器块学习的，该块将模拟所有初始姿势之间的相互作用，无论其排序或数字如何。一项彻底的评估表明，我们的方法能够通过大幅度的边缘提高最初估计的3D姿势的性能，从而在标准化的基准上取得了最新的结果。此外，所提出的模块以计算有效的方式工作，可以用作多人场景中任何3D姿势检测器的液位补充。

The recovery of multi-person 3D poses from a single RGB image is a severely ill-conditioned problem due to the inherent 2D-3D depth ambiguity, inter-person occlusions, and body truncations. To tackle these issues, recent works have shown promising results by simultaneously reasoning for different people. However, in most cases this is done by only considering pairwise person interactions, hindering thus a holistic scene representation able to capture long-range interactions. This is addressed by approaches that jointly process all people in the scene, although they require defining one of the individuals as a reference and a pre-defined person ordering, being sensitive to this choice. In this paper, we overcome both these limitations, and we propose an approach for multi-person 3D pose estimation that captures long-range interactions independently of the input order. For this purpose, we build a residual-like permutation-invariant network that successfully refines potentially corrupted initial 3D poses estimated by an off-the-shelf detector. The residual function is learned via Set Transformer blocks, that model the interactions among all initial poses, no matter their ordering or number. A thorough evaluation demonstrates that our approach is able to boost the performance of the initially estimated 3D poses by large margins, achieving state-of-the-art results on standardized benchmarks. Additionally, the proposed module works in a computationally efficient manner and can be potentially used as a drop-in complement for any 3D pose detector in multi-people scenes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题