通过预训练的视觉和语言模型的多模式开放式视频分类

论文标题

通过预训练的视觉和语言模型的多模式开放式视频分类

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

论文作者

Qian, Rui, Li, Yeqing, Xu, Zheng, Yang, Ming-Hsuan, Belongie, Serge, Cui, Yin

论文摘要

利用在大规模图像文本对中预先训练的视觉和语言模型（VLM）已成为开放式视觉识别的有希望的范式。在这项工作中，我们通过利用视频中自然存在的运动和音频来扩展这种范式。我们提出\ textbf {mov}，这是\ textbf {m} ult-imodal \ textbf {o} pen- \ textbf {v} ocabulary视频分类的简单而有效的方法。在MOV中，我们直接使用具有最小修改的预训练VLM的视觉编码器来编码视频，光流和音频频谱图。我们设计一种跨模式融合机制来汇总免费的多模式信息。 Kinetics-700和VGGSOUND的实验表明，引入流程或音频模态在预先训练的VLM和现有方法上带来了巨大的性能增长。具体而言，MOV极大地提高了基础类别的准确性，而在新颖的课程上则更好地概括了。 MOV在UCF和HMDB零击视频分类基准上实现了最新的结果，这显着超过了基于VLMS的传统零摄像方法和最新方法。代码和模型将发布。

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

下载PDF全文

下载文献需遵守相关版权规定

论文标题