论文标题
解开属性和对象的视觉嵌入
Disentangling Visual Embeddings for Attributes and Objects
论文作者
论文摘要
我们研究了对象 - 属性识别的组成零射击学习问题。先前的作品使用用骨干网络提取的视觉功能,预先训练对象分类,因此不会捕获与属性相关的巧妙不同特征。为了克服这一挑战,这些研究采用了语言空间的监督,并使用预训练的单词嵌入更好地分开并构成属性对象对以识别。类似于语言嵌入空间,该空间已经具有用于对象和属性的独特且不可知的嵌入,我们将焦点转移回了视觉空间,并提出了一种新颖的体系结构,该架构可以在视觉空间中删除属性和对象特征。我们使用视觉分解的特征来幻觉嵌入,这些嵌入代表了可见的和新颖的组成,以更好地正规化模型的学习。广泛的实验表明,我们的方法在三个数据集上优于现有工作:MIT状态,UT-ZAPPOS和基于VAW创建的新基准测试。代码,模型和数据集拆分可在https://github.com/nirat1606/oadis上公开获得。
We study the problem of compositional zero-shot learning for object-attribute recognition. Prior works use visual features extracted with a backbone network, pre-trained for object classification and thus do not capture the subtly distinct features associated with attributes. To overcome this challenge, these studies employ supervision from the linguistic space, and use pre-trained word embeddings to better separate and compose attribute-object pairs for recognition. Analogous to linguistic embedding space, which already has unique and agnostic embeddings for object and attribute, we shift the focus back to the visual space and propose a novel architecture that can disentangle attribute and object features in the visual space. We use visual decomposed features to hallucinate embeddings that are representative for the seen and novel compositions to better regularize the learning of our model. Extensive experiments show that our method outperforms existing work with significant margin on three datasets: MIT-States, UT-Zappos, and a new benchmark created based on VAW. The code, models, and dataset splits are publicly available at https://github.com/nirat1606/OADis.