会议中主动演讲者检测的音频视频融合策略

论文标题

会议中主动演讲者检测的音频视频融合策略

Audio-video fusion strategies for active speaker detection in meetings

论文作者

Pibre, Lionel, Madrigal, Francisco, Equoy, Cyrille, Lerasle, Frédéric, Pellegrini, Thomas, Pinquier, Julien, Ferrané, Isabelle

论文摘要

会议在专业背景下是一项普遍的活动，赋予发声助手具有高级功能以促进会议管理仍然充满挑战。在这种情况下，诸如主动扬声器检测之类的任务可以提供有用的见解，以模拟会议参与者之间的互动。在与高级会议助理有关的应用程序上下文中，我们希望将音频和视觉信息结合起来，以实现最佳性能。在本文中，我们提出了两种不同类型的融合，以检测活跃的扬声器，结合了两种视觉方式和通过神经网络的音频方式。为了进行比较，还使用了用于音频特征提取的经典无监督方法。我们预计，基于唇部和面部手势的检测，将视觉数据以每个参与者面向每个参与者的表面非常适合检测语音活动。因此，我们的基线系统使用视觉数据，我们选择了一个3D卷积神经网络结构，这对于同时编码外观和运动有效。为了改善该系统，我们通过使用CNN或无监督的扬声器诊断系统处理音频流来补充视觉信息。我们通过使用光流进行运动来进一步改进了该系统。我们通过公共和最先进的基准评估了我们的建议：AMI语料库。我们分析了每个系统对进行合并的贡献，以确定当前是否在讲给定参与者。我们还讨论了我们获得的结果。此外，我们已经表明，对于我们的应用程序上下文，添加运动信息会大大提高性能。最后，我们已经表明，基于注意力的融合会在减少标准偏差的同时提高性能。

Meetings are a common activity in professional contexts, and it remains challenging to endow vocal assistants with advanced functionalities to facilitate meeting management. In this context, a task like active speaker detection can provide useful insights to model interaction between meeting participants. Motivated by our application context related to advanced meeting assistant, we want to combine audio and visual information to achieve the best possible performance. In this paper, we propose two different types of fusion for the detection of the active speaker, combining two visual modalities and an audio modality through neural networks. For comparison purpose, classical unsupervised approaches for audio feature extraction are also used. We expect visual data centered on the face of each participant to be very appropriate for detecting voice activity, based on the detection of lip and facial gestures. Thus, our baseline system uses visual data and we chose a 3D Convolutional Neural Network architecture, which is effective for simultaneously encoding appearance and movement. To improve this system, we supplemented the visual information by processing the audio stream with a CNN or an unsupervised speaker diarization system. We have further improved this system by adding visual modality information using motion through optical flow. We evaluated our proposal with a public and state-of-the-art benchmark: the AMI corpus. We analysed the contribution of each system to the merger carried out in order to determine if a given participant is currently speaking. We also discussed the results we obtained. Besides, we have shown that, for our application context, adding motion information greatly improves performance. Finally, we have shown that attention-based fusion improves performance while reducing the standard deviation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题