使用可区分的相机预测，在多视图手术室视频中进行3D人姿势估算

论文标题

使用可区分的相机预测，在多视图手术室视频中进行3D人姿势估算

3D Human Pose Estimation in Multi-View Operating Room Videos Using Differentiable Camera Projections

论文作者

Gerats, Beerend G. A., Wolterink, Jelmer M., Broeders, Ivo A. M. J.

论文摘要

多视图手术室中的3D人姿势估计（OR）视频是人跟踪和行动识别的相关资产。但是，由于无菌服装，频繁的遮挡和有限的公共数据，寻找姿势的外科环境使得构成挑战。专门为OR设计的方法通常基于多个相机视图中检测到的姿势的融合。通常，诸如卷积神经网络（CNN）之类的2D姿势估计器检测到关节位置。然后，将检测到的关节位置投影到3D，并在所有相机视图上融合。但是，2D中的准确检测不能保证在3D空间中的准确定位。在这项工作中，我们建议通过训练2D CNN端到端的3D损失，直接优化3D中的本地化，该损失通过每个摄像机的投影参数反向传播。使用MVOR数据集中的视频，我们表明这种端到端方法在2D空间中的优化优于优化。

3D human pose estimation in multi-view operating room (OR) videos is a relevant asset for person tracking and action recognition. However, the surgical environment makes it challenging to find poses due to sterile clothing, frequent occlusions, and limited public data. Methods specifically designed for the OR are generally based on the fusion of detected poses in multiple camera views. Typically, a 2D pose estimator such as a convolutional neural network (CNN) detects joint locations. Then, the detected joint locations are projected to 3D and fused over all camera views. However, accurate detection in 2D does not guarantee accurate localisation in 3D space. In this work, we propose to directly optimise for localisation in 3D by training 2D CNNs end-to-end based on a 3D loss that is backpropagated through each camera's projection parameters. Using videos from the MVOR dataset, we show that this end-to-end approach outperforms optimisation in 2D space.

下载PDF全文

下载文献需遵守相关版权规定

论文标题