通过流动学习估算音频的视觉信息

论文标题

通过流动学习估算音频的视觉信息

Estimating Visual Information From Audio Through Manifold Learning

论文作者

Pedersoli, Fabrizio, Wiebe, Dryden, Banitalebi, Amin, Zhang, Yong, Tzanetakis, George, Yi, Kwang Moo

论文摘要

我们提出了一个新框架，用于仅使用音频信号来提取有关场景的视觉信息。基于音频的方法可以克服基于视觉的方法的某些局限性，即它们不需要“视线”，对照明的闭合和变化是可靠的，并且在视觉/激光传感器失败的情况下可以作为备份。因此，即使对于只有视觉信息感兴趣的应用程序，我们的框架基于多种学习，并且由两个步骤组成。首先，我们训练一个矢量定量的变分自动编码器，以了解我们感兴趣的特定视觉模态的数据歧管。其次，我们训练音频转换网络以将多通道音频信号映射到相应的视觉样本的潜在表示。我们证明我们的方法能够使用公开可用的音频/视觉数据集从音频中产生有意义的图像。特别是，我们考虑了来自音频的以下视觉方式的预测：深度和语义分割。我们希望我们的工作结果可以促进从音频中进行视觉信息提取的进一步研究。代码可在以下网址提供：https：//github.com/ubc-vision/audio_manifold。

We propose a new framework for extracting visual information about a scene only using audio signals. Audio-based methods can overcome some of the limitations of vision-based methods i.e., they do not require "line-of-sight", are robust to occlusions and changes in illumination, and can function as a backup in case vision/lidar sensors fail. Therefore, audio-based methods can be useful even for applications in which only visual information is of interest Our framework is based on Manifold Learning and consists of two steps. First, we train a Vector-Quantized Variational Auto-Encoder to learn the data manifold of the particular visual modality we are interested in. Second, we train an Audio Transformation network to map multi-channel audio signals to the latent representation of the corresponding visual sample. We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset. In particular, we consider the prediction of the following visual modalities from audio: depth and semantic segmentation. We hope the findings of our work can facilitate further research in visual information extraction from audio. Code is available at: https://github.com/ubc-vision/audio_manifold.

下载PDF全文

下载文献需遵守相关版权规定

论文标题