论文标题

MonoIndoor ++:倾向于更好地实践室内环境的自我监督单眼估计

MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

论文作者

Li, Runze, Ji, Pan, Xu, Yi, Bhanu, Bir

论文摘要

近年来,尤其是在户外环境中,自我监督的单眼深度估计取得了重大进展。但是,在大多数现有数据被手持设备捕获的室内场景中,深度预测结果无法满足。与室外环境相比,使用自审查的方法估算室内环境的单眼视频深度,带来了两个额外的挑战:(i)室内视频序列的深度范围在不同的框架之间变化很大,使得向距离网络很难引起一致的训练深度; (ii)用手持设备记录的室内序列通常包含更多的旋转运动,这会使姿势网络预测准确的相对相机姿势的困难。在这项工作中,我们通过对这些挑战进行特殊考虑并巩固了一系列良好实践,以提高室内环境中自我监督的单眼深度估算的表现,从而提出了一种新颖的框架单智+ ++。首先,提出了具有基于变压器的比例回归网络的深度分解模块,以明确估计全局深度尺度因子,预测的比例因子可以指示最大深度值。其次,我们建议使用剩余的姿势估计模块来估计相对摄像机在连续迭代的迭代中估算相对摄像头构成,而不是像以前的方法那样使用单阶段姿势估计策略。第三,为了为我们的剩余姿势估计模块纳入广泛的坐标指南,我们建议直接在输入到姿势网络的输入上执行坐标卷积编码。提出的方法在各种基准室内数据集上进行了验证,即Euroc Mav,Nyuv2,Scannet和7片尺寸,证明了最先进的性能。

Self-supervised monocular depth estimation has seen significant progress in recent years, especially in outdoor environments. However, depth prediction results are not satisfying in indoor scenes where most of the existing data are captured with hand-held devices. As compared to outdoor environments, estimating depth of monocular videos for indoor environments, using self-supervised methods, results in two additional challenges: (i) the depth range of indoor video sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues for training; (ii) the indoor sequences recorded with handheld devices often contain much more rotational motions, which cause difficulties for the pose network to predict accurate relative camera poses. In this work, we propose a novel framework-MonoIndoor++ by giving special considerations to those challenges and consolidating a set of good practices for improving the performance of self-supervised monocular depth estimation for indoor environments. First, a depth factorization module with transformer-based scale regression network is proposed to estimate a global depth scale factor explicitly, and the predicted scale factor can indicate the maximum depth values. Second, rather than using a single-stage pose estimation strategy as in previous methods, we propose to utilize a residual pose estimation module to estimate relative camera poses across consecutive frames iteratively. Third, to incorporate extensive coordinates guidance for our residual pose estimation module, we propose to perform coordinate convolutional encoding directly over the inputs to pose networks. The proposed method is validated on a variety of benchmark indoor datasets, i.e., EuRoC MAV, NYUv2, ScanNet and 7-Scenes, demonstrating the state-of-the-art performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源