论文标题
I2MVFORMER:大型语言模型生成的多视图文档监督零摄像图像分类
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification
论文作者
论文摘要
最近的作品表明,来自在线资源的非结构化文本(文档)可以用作零照片分类的有用辅助信息。但是,这些方法需要访问Wikipedia等高质量来源,并且仅限于单个信息来源。在网络规模文本上训练的大型语言模型(LLM)显示出令人印象深刻的能力,可以重新利用他们对众多任务的知识。在这项工作中,我们提供了一个新的观点,可以使用LLM为零照片的图像分类模型提供文本监督。 LLM提供了来自不同注释者的一些文本说明作为示例。 LLM在这些示例上进行条件,以生成每个类的多个文本说明(称为视图)。我们提出的模型I2MVFormer学习了使用这些类视图的零照片分类的多视图语义嵌入。我们表明,类的每个文本视图都提供互补的信息,允许模型学习高度歧视的类嵌入。此外,我们表明I2MVFORMER与基线模型相比,LLM的多视文本监督更好。 I2MVFORMER在三个公共基准数据集上建立了一个新的最先进的方法,用于使用无监督的语义嵌入方式进行零拍图像分类。
Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class(referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.