对监督跨模式检索的视觉预训练模型的全面经验研究

论文标题

对监督跨模式检索的视觉预训练模型的全面经验研究

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

论文作者

Zeng, Zhixiong, Mao, Wenji

论文摘要

跨模式检索（CMR）是多模式计算和信息检索的重要研究主题，它将一种类型的数据作为查询来检索另一种类型的相关数据。它已被广泛用于许多实际应用中。最近，由剪辑代表的视觉培训模型表明了其在学习视觉和文本表示方面的优越性，并在各种视觉和语言相关的任务上获得了令人印象深刻的表现。尽管剪辑以及以前的预训练模型在无监督的CMR方面表现出了很大的性能提高，但由于缺乏多模级课程相关的共同表示，因此很少探索这些预训练的模型对监督CMR的性能和影响。在本文中，我们将CLIP作为当前的代表性视觉培训预培训模型进行全面的经验研究。我们评估了其性能和对监督CMR的影响，并试图回答几个关键的研究问题。为此，我们首先提出了一种新型的模型clip4CMR（用于跨模式检索的剪辑增强网络），该网络采用预训练的夹子作为骨干网络来执行监督的CMR。然后，通过CLIP4CMR框架，我们在当前CMR方法中重新审视不同学习目标的设计，以提供有关模型设计的新见解。此外，我们调查了应用CMR的最相关方面，包括对模式失衡的鲁棒性和对超参数的敏感性，以为实用应用提供新的观点。通过广泛的实验，我们表明CLIP4CMR在基准数据集上取得了显着改进，可以实现SOTA结果，并且可以用作基本框架，以经验研究监督CMR的关键研究问题，对模型设计和实际考虑以及对模型设计和实际考虑的重要意义。

Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval, which takes one type of data as the query to retrieve relevant data of another type. It has been widely used in many real-world applications. Recently, the vision-language pre-trained models represented by CLIP demonstrate its superiority in learning the visual and textual representations and gain impressive performance on various vision and language related tasks. Although CLIP as well as the previous pre-trained models have shown great performance improvement in the unsupervised CMR, the performance and impact of these pre-trained models on the supervised CMR were rarely explored due to the lack of common representation for the multimodal class-level associations. In this paper, we take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study. We evaluate its performance and impact on the supervised CMR, and attempt to answer several key research questions. To this end, we first propose a novel model CLIP4CMR (CLIP enhanced network for Cross-Modal Retrieval) that employs the pre-trained CLIP as backbone network to perform the supervised CMR. Then by means of the CLIP4CMR framework, we revisit the design of different learning objectives in current CMR methods to provide new insights on model design. Moreover, we investigate the most concerned aspects in applying CMR, including the robustness to modality imbalance and sensitivity to hyper-parameters, to provide new perspectives for practical applications. Through extensive experiments, we show that CLIP4CMR achieves the SOTA results with prominent improvements on the benchmark datasets, and can be used as a fundamental framework to empirically study the key research issues of the supervised CMR, with significant implications for model design and practical considerations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题