论文标题
视听蒙版自动编码器
Audiovisual Masked Autoencoders
论文作者
论文摘要
我们可以利用视频中已经存在的视听信息来改善自我监督的表示形式学习吗?为了回答这个问题,我们研究了蒙面自动编码框架内的各种预处理结构和目标,这是由于自然语言和图像理解中类似方法的成功所激发的。我们表明,我们可以在视听下游分类任务上取得重大改进,超过VGGSOUND和AUDIOSET的最先进。此外,我们可以使用单个视听预处理模型来利用多个单峰下游任务来利用视听预审进方案。我们还证明了我们表示形式的可转让性,在没有专门为该数据集的情况下预处理的史诗般的厨房上实现了最先进的视听结果。
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.