论文标题
I2DFORMER:学习图像以记录零拍图像分类的注意
I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification
论文作者
论文摘要
尽管在零射击学习(ZSL)方面取得了巨大进展,但大多数现有方法仍然依赖于人类通知的属性,这些属性很难注释和扩展。一个无监督的替代方法是使用与其语义类名称相关的单词嵌入来表示每个类。但是,从预训练的语言模型中提取的单词嵌入并不一定会捕获视觉相似性,从而导致零拍的性能差。在这项工作中,我们认为在线文本文档,例如Wikipedia,包含有关对象类的丰富视觉描述,因此可以用作ZSL的强大无监督的侧面信息。为此,我们提出了I2Dformer,这是一种基于变压器的新型ZSL框架,共同学习通过在共享嵌入空间中对准这两种方式来编码图像和文档。为了从嘈杂的文档中提取歧视性的视觉单词,我们介绍了一个新的跨模式注意模块,该模块可以学习图像补丁和文档单词之间的细粒度相互作用。因此,我们的i2dformer不仅学习了捕获视觉相似性的高度歧视文档的嵌入,而且还获得了在图像区域中定位视觉相关单词的能力。定量地,我们证明我们的i2 former在三个公共数据集上的零照片和广义的零局学习设置下都显着优于以前无监督的语义嵌入。定性地,我们表明我们的方法会导致高度可解释的结果,其中文档单词可以基于图像区域。
Despite the tremendous progress in zero-shot learning(ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents by aligning both modalities in a shared embedding space. In order to distill discriminative visual words from noisy documents, we introduce a new cross-modal attention module that learns fine-grained interactions between image patches and document words. Consequently, our I2DFormer not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to localize visually relevant words in image regions. Quantitatively, we demonstrate that our I2DFormer significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our method leads to highly interpretable results where document words can be grounded in the image regions.