论文标题
用于文本视频检索的跨模式适配器
Cross-Modal Adapter for Text-Video Retrieval
论文作者
论文摘要
文本视频检索是一项重要的多模式学习任务,其目标是为给定文本查询检索最相关的视频。最近,预训练的模型,例如剪辑,在此任务上显示出很大的潜力。但是,随着预训练的模型正在扩大规模,将其完全微调在文本视频检索数据集上具有很高的过度拟合风险。此外,在实践中,培训和存储每个任务的大型模型将是昂贵的。为了克服上述问题,我们提出了一个新颖的$ \ textbf {cross-modal apapter} $,用于参数有效的微调。受基于适配器的方法的启发,我们使用一些参数化层调整了预训练的模型。但是,有两个显着的差异。首先,我们的方法是为多模式域而设计的。其次,它允许夹子的两个编码器之间的早期跨模式相互作用。尽管令人惊讶的是,我们的方法具有三个值得注意的好处:(1)减少$ \ textbf {99.6} \%$的微调参数,并减轻过度拟合的问题,(2)节省了大约30%的培训时间,(3)允许所有预先转移的参数可以固定,可以共享预先的模型。广泛的实验表明,与MSR-VTT,MSVD,VATEX,ActivityNet和DIDEMO数据集的完全微调方法相比,它在没有铃铛和哨子的情况下取得了优越或可比的性能。该代码将在\ url {https://github.com/leaplabthu/cross-modal-adapter}上提供。
Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel $\textbf{Cross-Modal Adapter}$ for parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows early cross-modal interactions between CLIP's two encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces $\textbf{99.6}\%$ of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, it achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets. The code will be available at \url{https://github.com/LeapLabTHU/Cross-Modal-Adapter}.