解剖图像文本检索的深度度量学习损失

论文标题

解剖图像文本检索的深度度量学习损失

Dissecting Deep Metric Learning Losses for Image-Text Retrieval

论文作者

Xuan, Hong, Chen, Xi

论文摘要

视觉语义嵌入（VSE）是图像文本检索中普遍的方法，通过学习将保留语义相似性的图像和语言方式之间的关节嵌入空间。用硬化采矿的三胞胎损失已成为大多数VSE方法的事实目标。受图像域中深度度量学习（DML）的最新进展的启发，这引起了新的损失函数，以优于三胞胎损失，在本文中，我们重新审视了在图像文本匹配中找到更好的VSE目标的问题。尽管尝试根据梯度运动设计损失，但大多数DML损失在嵌入空间中的经验定义。与其直接应用这些损失功能，这些损失功能可能导致模型参数中的亚最佳梯度更新，而是在本文中提出了一种新型的基于梯度的客观分析框架或\ textIt {goal {goal {goal {goal {goal {goal {goal {goal {goal {goal {goal {goal {goal {goal {goars}，而是在现有DML函数中的梯度组合和重新加权。在此分析框架的帮助下，我们进一步提出了一个新的目标家族，以探索不同的梯度组合。如果梯度不可集成到有效的损失函数，我们实施了我们提出的目标，以便它们直接在梯度空间中运行，而不是在嵌入空间中的损失上运行。全面的实验表明，我们的新目标始终提高了不同视觉/文本特征和模型框架的基线的性能。我们还通过使用三胞胎家族损失（包括具有重大跨模式相互作用的视觉模型）将目标框架扩展到其他模型，从而展示了目标框架的普遍性，并在可可和flick30k上的图像文本检索任务上实现了最新的结果。

Visual-Semantic Embedding (VSE) is a prevalent approach in image-text retrieval by learning a joint embedding space between the image and language modalities where semantic similarities would be preserved. The triplet loss with hard-negative mining has become the de-facto objective for most VSE methods. Inspired by recent progress in deep metric learning (DML) in the image domain which gives rise to new loss functions that outperform triplet loss, in this paper, we revisit the problem of finding better objectives for VSE in image-text matching. Despite some attempts in designing losses based on gradient movement, most DML losses are defined empirically in the embedding space. Instead of directly applying these loss functions which may lead to sub-optimal gradient updates in model parameters, in this paper we present a novel Gradient-based Objective AnaLysis framework, or \textit{GOAL}, to systematically analyze the combinations and reweighting of the gradients in existing DML functions. With the help of this analysis framework, we further propose a new family of objectives in the gradient space exploring different gradient combinations. In the event that the gradients are not integrable to a valid loss function, we implement our proposed objectives such that they would directly operate in the gradient space instead of on the losses in the embedding space. Comprehensive experiments have demonstrated that our novel objectives have consistently improved performance over baselines across different visual/text features and model frameworks. We also showed the generalizability of the GOAL framework by extending it to other models using triplet family losses including vision-language model with heavy cross-modal interactions and have achieved state-of-the-art results on the image-text retrieval tasks on COCO and Flick30K.

下载PDF全文

下载文献需遵守相关版权规定

论文标题