论文标题
零拍学习的语义接地视觉嵌入
Semantically Grounded Visual Embeddings for Zero-Shot Learning
论文作者
论文摘要
零拍的学习方法依赖于固定的视觉和语义嵌入,这些嵌入是从独立视觉和语言模型中提取的,这都是针对其他大规模任务进行的预训练。这是当前的零射击学习框架的弱点,因为这种不相交的嵌入无法将视觉和文本信息与其共享的语义内容相关联。因此,我们建议通过在代理任务上使用两个流网络计算联合图像和文本模型来学习语义扎根和丰富的视觉信息。为了改善属性提供的图像和文本表示之间的对齐方式,我们利用辅助字幕提供基础的语义信息。我们的方法是在几个基准数据集上评估了零拍学习的联合嵌入,可在标准标准中提高现有最新方法的性能(APY上的$+++1.6 $ \%,FLO上的$+2.6 \%$)和广义($++++++++2.1 \%$ $+2.1 \%$ in AWA $ 2 $ 2 $+2.2 $+2.2 $+2.2 \%cub in cub in Cub in Cubtition in Cubtition in Cub in Cub incorition in Cub in Cubtition)
Zero-shot learning methods rely on fixed visual and semantic embeddings, extracted from independent vision and language models, both pre-trained for other large-scale tasks. This is a weakness of current zero-shot learning frameworks as such disjoint embeddings fail to adequately associate visual and textual information to their shared semantic content. Therefore, we propose to learn semantically grounded and enriched visual information by computing a joint image and text model with a two-stream network on a proxy task. To improve this alignment between image and textual representations, provided by attributes, we leverage ancillary captions to provide grounded semantic information. Our method, dubbed joint embeddings for zero-shot learning is evaluated on several benchmark datasets, improving the performance of existing state-of-the-art methods in both standard ($+1.6$\% on aPY, $+2.6\%$ on FLO) and generalized ($+2.1\%$ on AWA$2$, $+2.2\%$ on CUB) zero-shot recognition.