论文标题
vitaa:通过自然语言搜索的视觉文本属性对齐
ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language
论文作者
论文摘要
通过自然语言搜索的人旨在在与给定文本描述匹配的大规模图像池中检索特定的人。虽然当前大多数方法将任务视为与之匹配的整体视觉和文本特征,但我们从属性对准的角度将其处理,该角度允许将特定属性短语接地到相应的视觉区域。我们通过强大的功能学习实现了成功,并通过多个属性视觉提示可以准确地捆绑了引用的身份。为了具体,我们的视觉文本属性对齐模型(称为Vitaa)学会了使用轻度辅助属性分割计算分支将人的特征空间置于与属性相对应的子空间中。然后,它通过使用新颖的对比学习损失来使这些视觉特征与从句子中解析的文本属性保持一致。因此,我们通过对人搜索的任务进行自然语言和属性 - “查询”的搜索任务来验证我们的VITAA框架,我们的系统在其上实现了最先进的性能。代码将在出版后公开使用。
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as the performance boosting by a robust feature learning that the referred identity can be accurately bundled by multiple attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation computing branch. It then aligns these visual features with the textual attributes parsed from the sentences by using a novel contrastive learning loss. Upon that, we validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Code will be publicly available upon publication.