以自我为中心的深度多通道音频扬声器本地化

论文标题

以自我为中心的深度多通道音频扬声器本地化

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

论文作者

Jiang, Hao, Murdock, Calvin, Ithapu, Vamsi Krishna

论文摘要

增强现实设备有可能增强人类的看法并在复杂的对话环境中实现其他辅助功能。有效地捕获理解这些社交互动所必需的视听环境，首先需要检测和本地化设备佩戴者和周围人的语音活动。这些任务由于其自然性质而具有挑战性：佩戴者的头部运动可能会导致运动模糊，周围的人可能会出现在艰难的角度，并且可能存在遮挡，视觉混乱，音频噪音和不良照明。在这些条件下，先前的先前主动扬声器检测方法没有给出令人满意的结果。取而代之的是，我们使用视频和多通道麦克风阵列的音频从新设置中解决了问题。我们提出了一种新颖的端到端深度学习方法，能够给出强大的语音活动检测和定位结果。与以前的方法相反，我们的方法将从球体上的所有可能方向（即使在摄像机的视野之外）定位，同时又检测了设备佩戴者自己的语音活动。我们的实验表明，所提出的方法给出了较高的结果，可以实时运行，并且对噪音和混乱非常有力。

Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may appear in difficult viewing angles, and there may be occlusions, visual clutter, audio noise, and bad lighting. Under these conditions, previous state-of-the-art active speaker detection methods do not give satisfactory results. Instead, we tackle the problem from a new setting using both video and multi-channel microphone array audio. We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results. In contrast to previous methods, our method localizes active speakers from all possible directions on the sphere, even outside the camera's field of view, while simultaneously detecting the device wearer's own voice activity. Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.

下载PDF全文

下载文献需遵守相关版权规定

论文标题