请参阅Finer，请参阅更多：基于文本的人检索的隐性方式对齐

论文标题

请参阅Finer，请参阅更多：基于文本的人检索的隐性方式对齐

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

论文作者

Shu, Xiujun, Wen, Wei, Wu, Haoqian, Chen, Keyu, Song, Yiran, Qiao, Ruizhi, Ren, Bo, Wang, Xiao

论文摘要

基于文本的人检索旨在根据文本描述找到查询人员。关键是要学习视觉文本模式之间的常见潜在空间映射。为了实现这一目标，现有的作品采用细分来获得明确的跨模式对齐方式或利用注意力来探索显着的对准。这些方法有两个缺点：1）标记交叉模式比对很耗时。 2）注意方法可以探索显着的跨模式对齐，但可能会忽略一些微妙而有价值的对。为了缓解这些问题，我们为基于文本的人检索引入了一个隐式视觉文本（IVT）框架。与以前的模型不同，IVT利用单个网络来学习两种模式的表示形式，这有助于视觉文本相互作用。为了探索细粒的对准，我们进一步提出了两个隐式语义对齐范式：多级比对（MLA）和双向掩码建模（BMM）。 MLA模块在句子，短语和单词级别上探索了更精细的匹配，而BMM模块旨在挖掘视觉和文本模态之间的\ textbf {更多}语义对齐。进行了广泛的实验，以评估公共数据集中提出的IVT，即Cuhk-Pedes，RSTPREID和ICFG-PEDES。即使没有明确的身体部位对齐，我们的方法仍然可以达到最先进的表现。代码可在以下网址找到：https：//github.com/tencentyouturesearch/personretrieval-ivt。

Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores finer matching at sentence, phrase, and word levels, while the BMM module aims to mine \textbf{more} semantic alignments between visual and textual modalities. Extensive experiments are carried out to evaluate the proposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES. Even without explicit body part alignment, our approach still achieves state-of-the-art performance. Code is available at: https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题