视觉指导的自我监督语音表征学习

论文标题

视觉指导的自我监督语音表征学习

Visually Guided Self Supervised Learning of Speech Representations

论文作者

Shukla, Abhinav, Vougioukas, Konstantinos, Ma, Pingchuan, Petridis, Stavros, Pantic, Maja

论文摘要

自我监督的表示学习最近引起了音频和视觉方式的许多研究兴趣。但是，大多数作品通常专注于特定的方式或功能，并且有非常有限的工作来研究两种学习自我监督表示形式之间的相互作用。我们提出了一个学习音频表示的框架，这些框架在视觉语音的背景下以视觉方式为指导。我们采用生成的音频到视频训练方案，在该方案中，我们对与给定音频剪辑相对应的静止图像进行动画化，并优化生成的视频，以与语音段的真实视频尽可能接近。通过此过程，音频编码器网络将学习有用的语音表示，我们对情绪识别和语音识别进行评估。我们实现了情绪识别和语音识别竞争成果的最新成果。这证明了视觉监督对学习音频表示形式的潜力，这是一种自我监督学习的新方法，过去曾经探索过。拟议的无监督音频功能可以利用无标记的视听语音的几乎无限量的培训数据，并具有大量潜在的有希望的应用程序。

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题