论文标题

深入研究剪辑的开放性

Delving into the Openness of CLIP

论文作者

Ren, Shuhuai, Li, Lei, Ren, Xuancheng, Zhao, Guangxiang, Sun, Xu

论文摘要

对比语言图像预训练(剪辑)将图像分类作为图像到文本匹配任务,即将图像匹配到相应的自然语言描述而不是离散类别ID。这允许开放式视觉识别,该模型可以以零拍的方式从开放类集(也称为开放词汇)中识别图像。但是,评估类似夹的模型的开放性是具有挑战性的,因为这些模型在理论上开放了任意词汇,但其准确性在实践中有所不同。为了解决这个问题,我们求助于通过词汇扩展来评估开放性的增量观点,并定义了可扩展性来衡量模型处理新颖类的能力。我们的评估表明,类似夹的模型并不是真正开放的,并且随着词汇量的扩展,它们的性能会恶化。我们从表示和均匀性的角度进一步剖析了剪辑的特征空间。我们的调查表明,对开放性的高估是由于竞争文本特征之间的混乱,而不是未能捕获新颖类的图像特征和文本特征之间的相似性。我们希望我们的调查和分析能够促进对剪辑开放性问题的未来研究。

Contrastive Language-Image Pre-training (CLIP) formulates image classification as an image-to-text matching task, i.e., matching images to the corresponding natural language descriptions instead of discrete category IDs. This allows for open-vocabulary visual recognition, where the model can recognize images from an open class set (also known as an open vocabulary) in a zero-shot manner. However, evaluating the openness of CLIP-like models is challenging, as the models are open to arbitrary vocabulary in theory, but their accuracy varies in practice. To address this, we resort to an incremental perspective to assess the openness through vocabulary expansions, and define extensibility to measure a model's ability to handle novel classes. Our evaluation shows that CLIP-like models are not truly open, and their performance deteriorates as the vocabulary expands. We further dissect the feature space of CLIP from the perspectives of representation alignment and uniformity. Our investigation reveals that the overestimation of openness is due to confusion among competing text features, rather than a failure to capture the similarity between image features and text features of novel classes. We hope that our investigation and analysis will facilitate future research on the CLIP openness issue.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源