老板：自下而上的跨模式语义构图和混合反事实培训，用于鲁棒的基于内容的图像检索

论文标题

老板：自下而上的跨模式语义构图和混合反事实培训，用于鲁棒的基于内容的图像检索

BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid Counterfactual Training for Robust Content-based Image Retrieval

论文作者

Zhang, Wenqiao, Guo, Jiannan, Li, Mengze, Shi, Haochen, Zhang, Shengyu, Li, Juncheng, Tang, Siliang, Zhuang, Yueting

论文摘要

基于内容的图像检索（CIR）旨在通过同时理解示例图像和互补文本的组成来搜索目标图像，这可能会影响各种真实世界的应用程序，例如Internet搜索和时尚检索。在这种情况下，输入映像是搜索的直观上下文和背景，而相应的语言明确请求有关如何修改查询图像的特定特征以获取预期目标图像的新特征。此任务是具有挑战性的，因为它需要通过结合跨粒度语义更新来学习和理解复合图像文本表示。在本文中，我们通过小说\下划线{\ textbf {b}} ottom-up cr \下划线{\ textbf {o}} ss-modal \下划线{\ textbf {s s}}反事实训练框架，通过从两个先前被忽视的角度研究CIR任务的新启示：\ emph {\ emph {隐式自下而上的粘膜语言表示}和\ emph {显式质量粒子构建的对应关系}。一方面，我们利用了从底部本地特征到顶部全球语义的跨模式嵌入的隐式相互作用和组成，从而在多个连续的步骤中保存和转换了以语言语义为条件的视觉表示，以进行有效的目标图像搜索。另一方面，我们设计了一种混合反事实培训策略，可以减少模型对类似查询的歧义。

Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text, which potentially impacts a wide variety of real-world applications, such as internet search and fashion retrieval. In this scenario, the input image serves as an intuitive context and background for the search, while the corresponding language expressly requests new traits on how specific characteristics of the query image should be modified in order to get the intended target image. This task is challenging since it necessitates learning and understanding the composite image-text representation by incorporating cross-granular semantic updates. In this paper, we tackle this task by a novel \underline{\textbf{B}}ottom-up cr\underline{\textbf{O}}ss-modal \underline{\textbf{S}}emantic compo\underline{\textbf{S}}ition (\textbf{BOSS}) with Hybrid Counterfactual Training framework, which sheds new light on the CIR task by studying it from two previously overlooked perspectives: \emph{implicitly bottom-up composition of visiolinguistic representation} and \emph{explicitly fine-grained correspondence of query-target construction}. On the one hand, we leverage the implicit interaction and composition of cross-modal embeddings from the bottom local characteristics to the top global semantics, preserving and transforming the visual representation conditioned on language semantics in several continuous steps for effective target image search. On the other hand, we devise a hybrid counterfactual training strategy that can reduce the model's ambiguity for similar queries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题