论文标题
M-Fuse:场景流估计的多帧融合
M-FUSE: Multi-frame Fusion for Scene Flow Estimation
论文作者
论文摘要
最近,现场流估计的神经网络在汽车数据(例如Kitti基准测试)上显示出令人印象深刻的结果。但是,尽管使用了复杂的刚性假设和参数化,但此类网络通常仅限于两个帧对,这使它们无法利用时间信息。在我们的论文中,我们通过提出一种新型的多帧方法来解决这一缺点,该方法考虑了前一个立体对。为此,我们采取了两个步骤:首先,基于最近的Raft-3D方法,我们通过合并高级立体声方法来开发改进的两框基线。其次,更重要的是,利用了RAFT-3D的特定建模概念,我们提出了一个U-NET体系结构,该体系结构可以执行向前和向后流量估计的融合,因此允许按需将时间信息集成。 Kitti基准测试的实验不仅表明了改进的基线和时间融合方法的优势相互补充,而且还证明了计算的场景流非常准确。更确切地说,我们的方法在更具挑战性的前景对象中排名第二,首先排名第一,总体上优于原始RAFT-3D方法超过16%。代码可在https://github.com/cv-stuttgart/m-fuse上找到。
Recently, neural network for scene flow estimation show impressive results on automotive data such as the KITTI benchmark. However, despite of using sophisticated rigidity assumptions and parametrizations, such networks are typically limited to only two frame pairs which does not allow them to exploit temporal information. In our paper we address this shortcoming by proposing a novel multi-frame approach that considers an additional preceding stereo pair. To this end, we proceed in two steps: Firstly, building upon the recent RAFT-3D approach, we develop an improved two-frame baseline by incorporating an advanced stereo method. Secondly, and even more importantly, exploiting the specific modeling concepts of RAFT-3D, we propose a U-Net architecture that performs a fusion of forward and backward flow estimates and hence allows to integrate temporal information on demand. Experiments on the KITTI benchmark do not only show that the advantages of the improved baseline and the temporal fusion approach complement each other, they also demonstrate that the computed scene flow is highly accurate. More precisely, our approach ranks second overall and first for the even more challenging foreground objects, in total outperforming the original RAFT-3D method by more than 16%. Code is available at https://github.com/cv-stuttgart/M-FUSE.