用于文本视频检索的跨模式适配器

论文标题

用于文本视频检索的跨模式适配器

Cross-Modal Adapter for Text-Video Retrieval

论文作者

Jiang, Haojun, Zhang, Jianke, Huang, Rui, Ge, Chunjiang, Ni, Zanlin, Lu, Jiwen, Zhou, Jie, Song, Shiji, Huang, Gao

论文摘要

文本视频检索是一项重要的多模式学习任务，其目标是为给定文本查询检索最相关的视频。最近，预训练的模型，例如剪辑，在此任务上显示出很大的潜力。但是，随着预训练的模型正在扩大规模，将其完全微调在文本视频检索数据集上具有很高的过度拟合风险。此外，在实践中，培训和存储每个任务的大型模型将是昂贵的。为了克服上述问题，我们提出了一个新颖的$ \ textbf {cross-modal apapter} $，用于参数有效的微调。受基于适配器的方法的启发，我们使用一些参数化层调整了预训练的模型。但是，有两个显着的差异。首先，我们的方法是为多模式域而设计的。其次，它允许夹子的两个编码器之间的早期跨模式相互作用。尽管令人惊讶的是，我们的方法具有三个值得注意的好处：（1）减少$ \ textbf {99.6} \％$的微调参数，并减轻过度拟合的问题，（2）节省了大约30％的培训时间，（3）允许所有预先转移的参数可以固定，可以共享预先的模型。广泛的实验表明，与MSR-VTT，MSVD，VATEX，ActivityNet和DIDEMO数据集的完全微调方法相比，它在没有铃铛和哨子的情况下取得了优越或可比的性能。该代码将在\ url {https://github.com/leaplabthu/cross-modal-adapter}上提供。

Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel $\textbf{Cross-Modal Adapter}$ for parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows early cross-modal interactions between CLIP's two encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces $\textbf{99.6}\%$ of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, it achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets. The code will be available at \url{https://github.com/LeapLabTHU/Cross-Modal-Adapter}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题