论文标题
通过多模式知识转移的开放式多标签分类
Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer
论文作者
论文摘要
现实世界识别系统通常会遇到看不见的标签的挑战。为了识别这种看不见的标签,多标签的零光学习(ML-ZSL)着重于通过预先训练的文本标签嵌入(例如,手套)传输知识。但是,这种方法仅利用语言模型利用单模式知识,同时忽略图像文本对固有的丰富语义信息。取而代之的是,最近开发的基于开放式摄影的方法(OV)方法成功地利用了对象检测中图像文本对的此类信息,并实现了令人印象深刻的性能。受基于OV的方法的成功启发,我们提出了一个新型的开放式视频框架,称为多模式知识转移(MKT),用于多标签分类。具体而言,我们的方法基于视觉和语言预训练(VLP)模型利用图像文本对的多模式知识。为了促进VLP模型的图像文本匹配能力,采用知识蒸馏来保证图像和标签嵌入的一致性,以及及时调整以进一步更新标签嵌入。为了进一步允许多个对象的识别,开发了一个简单但有效的两流模块,以捕获本地和全局功能。广泛的实验结果表明,我们的方法在公共基准数据集上的表现明显优于最先进的方法。源代码可在https://github.com/sunanhe/mkt上找到。
Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets. The source code is available at https://github.com/sunanhe/MKT.