论文标题
向视听导航支付自我注意
Pay Self-Attention to Audio-Visual Navigation
论文作者
论文摘要
作为热门研究主题,音频视频体现的导航旨在训练机器人使用以自我为中心的视觉(来自安装在机器人上的传感器)和音频(从目标发射)输入的音频目标。视听信息融合策略对于导航性能自然很重要,但是最新的方法仍然简单地将视觉和音频特征串联,可能忽略上下文的直接影响。此外,现有方法需要阶段训练或额外的辅助(例如拓扑图和声音语义)。直到这个日期,处理目标目标更具挑战性的设置的工作仍然很少见。结果,我们提出了一个端到端框架FSAAVN(功能自我注意力音频视频导航),以使用上下文感知到的音频视频融合策略在移动音频目标之后学习追逐,该策略实现为自我注意力。我们的彻底实验与最先进的FSAAVN验证了FSAAVN的出色性能(无论是定量还是定性),还提供了有关视觉方式,视觉/音频编码器骨架和融合模式的选择的独特见解。
Audio-visual embodied navigation, as a hot research topic, aims training a robot to reach an audio target using egocentric visual (from the sensors mounted on the robot) and audio (emitted from the target) input. The audio-visual information fusion strategy is naturally important to the navigation performance, but the state-of-the-art methods still simply concatenate the visual and audio features, potentially ignoring the direct impact of context. Moreover, the existing approaches requires either phase-wise training or additional aid (e.g. topology graph and sound semantics). Up till this date, the work that deals with the more challenging setup with moving target(s) is still rare. As a result, we propose an end-to-end framework FSAAVN (feature self-attention audio-visual navigation) to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy implemented as a self-attention module. Our thorough experiments validate the superior performance (both quantitatively and qualitatively) of FSAAVN in comparison with the state-of-the-arts, and also provide unique insights about the choice of visual modalities, visual/audio encoder backbones and fusion patterns.