论文标题
基于事件的单眼密度深度估计,带有经常性变压器
Event-based Monocular Dense Depth Estimation with Recurrent Transformers
论文作者
论文摘要
提供高时间分辨率和高动态范围的事件摄像机带来了一种新的观点,可以解决单眼深度估计中的共同挑战(例如,运动模糊和低光)。但是,如何有效利用异步事件的稀疏空间信息和丰富的时间提示仍然是一项艰巨的努力。为此,我们提出了一种具有复发变压器的新型基于事件的单眼深度估计器,即Ereformer,这是第一个具有处理连续事件流的递归机制的纯变压器。从技术上讲,为空间建模,提出了一种具有空间变压器融合模块的新型基于变压器的编码器,比基于CNN的方法具有更好的全局上下文信息建模功能。对于时间建模,我们设计了一个栅极反复的视觉变压器单元,该单元将递归机制引入变压器,从而提高了时间建模功能,同时降低了昂贵的GPU内存成本。实验结果表明,我们的EREFormer在合成数据集和现实世界数据集上的边距都优于最先进的方法。我们希望我们的工作将吸引进一步的研究,以在基于事件的视觉社区中发展出惊人的变压器。我们的开源代码可以在补充材料中找到。
Event cameras, offering high temporal resolutions and high dynamic ranges, have brought a new perspective to address common challenges (e.g., motion blur and low light) in monocular depth estimation. However, how to effectively exploit the sparse spatial information and rich temporal cues from asynchronous events remains a challenging endeavor. To this end, we propose a novel event-based monocular depth estimator with recurrent transformers, namely EReFormer, which is the first pure transformer with a recursive mechanism to process continuous event streams. Technically, for spatial modeling, a novel transformer-based encoder-decoder with a spatial transformer fusion module is presented, having better global context information modeling capabilities than CNN-based methods. For temporal modeling, we design a gate recurrent vision transformer unit that introduces a recursive mechanism into transformers, improving temporal modeling capabilities while alleviating the expensive GPU memory cost. The experimental results show that our EReFormer outperforms state-of-the-art methods by a margin on both synthetic and real-world datasets. We hope that our work will attract further research to develop stunning transformers in the event-based vision community. Our open-source code can be found in the supplemental material.