迈向视听导航的通用音频表示

论文标题

迈向视听导航的通用音频表示

Towards Generalisable Audio Representations for Audio-Visual Navigation

论文作者

Mao, Shunqi, Zhang, Chaoyi, Wang, Heng, Cai, Weidong

论文摘要

在视听导航（AVN）中，智能代理需要根据其音频和视觉感知在复杂的3D环境中导航到不断的声音制作对象。尽管现有的方法试图通过精心设计的路径计划或复杂的任务设置来改善导航性能，但没有一个改进了闻所未闻的声音的模型概括，而任务设置不变。因此，我们提出了一种基于对比的学习方法来通过正规化音频编码器来应对这一挑战，在这种情况下，可以从不同类别的各种音频信号中学到声音范围的目标驱动的潜在表示。此外，我们考虑了两种数据增强策略来丰富训练声音。我们证明，我们的设计可以轻松地装备到现有的AVN框架上，以获得即时的性能增益（副本上的Spl 13.4％$ \ uparrow $，在MP3D上的SPL中的12.2％$ \ uparrow $）。我们的项目可在https://av-gen.github.io/上找到。

In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments based on its audio and visual perceptions. While existing methods attempt to improve the navigation performance with preciously designed path planning or intricate task settings, none has improved the model generalisation on unheard sounds with task settings unchanged. We thus propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder, where the sound-agnostic goal-driven latent representations can be learnt from various audio signals of different classes. In addition, we consider two data augmentation strategies to enrich the training sounds. We demonstrate that our designs can be easily equipped to existing AVN frameworks to obtain an immediate performance gain (13.4%$\uparrow$ in SPL on Replica and 12.2%$\uparrow$ in SPL on MP3D). Our project is available at https://AV-GeN.github.io/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题