基于文本的人搜索的不对称跨尺度对齐

论文标题

基于文本的人搜索的不对称跨尺度对齐

Asymmetric Cross-Scale Alignment for Text-Based Person Search

论文作者

Ji, Zhong, Hu, Junhua, Liu, Deyin, Wu, Lin Yuanbo, zhao, Ye

论文摘要

基于文本的人搜索（TBP）在智能监视中至关重要，该监视旨在检索与给定文本描述具有高语义相关的行人图像。这种检索任务的特征是模态异质性和细粒度匹配。为了实现此任务，需要从图像和文本域中提取多尺度功能，然后执行跨模式对齐。但是，大多数现有方法仅考虑限制在单个尺度上的对齐方式，例如图像句子或区域短语量表。这种策略在特征提取中采用了可假定的对准，同时忽略了跨尺度对齐方式，例如图像式词句。在本文中，我们提出了一个基于变压器的模型，以提取多尺度表示，并执行不对称的跨尺度比对（ACSA），以精确地对齐两种方式。具体而言，ACSA由一个全球级别对准模块和一个不对称的交叉意见模块组成，其中前者在全球范围内对齐图像和文本，后者应用了交叉注意机制，以动态地对齐区域/图像短语量表中的跨模式实体。在两个基准数据集上进行的广泛实验CUHK-PEDES和RSTPREID证明了我们方法的有效性。代码可在\ href {url} {https://github.com/mul-hjh/acsa}中获得。

Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach. Codes are available at \href{url}{https://github.com/mul-hjh/ACSA}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题