对比度视听蒙版自动编码器

论文标题

对比度视听蒙版自动编码器

Contrastive Audio-Visual Masked Autoencoder

论文作者

Gong, Yuan, Rouditchenko, Andrew, Liu, Alexander H., Harwath, David, Karlinsky, Leonid, Kuehne, Hilde, Glass, James

论文摘要

在本文中，我们首先将最近的蒙版自动编码器（MAE）模型从单个模态扩展到视听多模式。随后，我们提出了对比度视听掩盖自动编码器（CAV-MAE），通过将对比度学习和掩盖数据建模（两个主要的自我监督学习框架）结合起来，以学习一个关节和协调的音频表现形式。我们的实验表明，对比的视听对应学习目标不仅使模型可以执行视听检索任务，而且还可以帮助模型学习更好的联合表示。结果，我们完全自我保护的预处理的Cav-Mae在VGGSOUND上实现了65.9％的新SOTA准确性，并且与先前在视听事件分类任务中的音频列表的最佳监督预告片模型相媲美。代码和预估计的模型在https://github.com/yuangongnd/cav-mae上。

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.

下载PDF全文

下载文献需遵守相关版权规定

论文标题