重新考虑主动扬声器检测的视听同步

论文标题

重新考虑主动扬声器检测的视听同步

Rethinking Audio-visual Synchronization for Active Speaker Detection

论文作者

Wuerkaixi, Abudukelimu, Zhang, You, Duan, Zhiyao, Zhang, Changshui

论文摘要

主动扬声器检测（ASD）系统是用于分析多对话对话的重要模块。他们的目的是在任何给定时间都在视觉场景中检测哪些扬声器或没有说话。关于ASD的现有研究不同意主动演讲者的定义。我们阐明了这项工作的定义，需要在音频和视觉演讲活动之间进行同步。这种定义的澄清是由我们的广泛实验激发的，我们发现现有的ASD方法无法建模视听同步，并且经常将非同步视频分类为活跃的语言。为了解决这个问题，我们提出了一种跨模式对比度学习策略，并在注意模块中应用位置编码，以供监督的ASD模型来利用同步提示。实验结果表明，我们的模型可以成功地检测出不同步的口语，因为它不说话，以解决当前模型的局限性。

Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题