学习用潜在文字提示分解视觉特征

论文标题

学习用潜在文字提示分解视觉特征

Learning to Decompose Visual Features with Latent Textual Prompts

论文作者

Wang, Feng, Li, Manling, Lin, Xudong, Lv, Hairong, Schwing, Alexander G., Ji, Heng

论文摘要

训练前视觉模型（如剪辑）的最新进展在学习可转移的视觉表示方面表现出巨大的潜力。但是，对于下游推断，类似夹子的模型遭受了1）在基于检索的推断期间的文本描述不准确的情况下，精度和鲁棒性降低了（零摄像协议的挑战）；或2）打破良好的视觉路线（线性探测的挑战）。为了解决它们，我们提出了分解的功能提示（DEFO）。 DeFo在保持视觉语言双模型架构的同时，利用了灵活数量的可学习嵌入作为文本输入，该架构使模型能够借助功能级文本文本提示来学习分解的视觉功能。我们进一步使用额外的线性层执行分类，从而允许语言输入的可扩展大小。我们的经验研究表明，Defo在改善视觉模型方面的重要性。例如，Defo在Imainet上获得了73.2％的测试精度，而Resnet-50骨干链，而无需调整视觉和语言编码器的任何预审预告额的重量，以优于15.0％的零距离零片，并且超过了先进的视觉及时及时的迅速调谐方法，以7.6％的速度调谐。

Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-established vision-language alignment (the challenge for linear probing). To address them, we propose Decomposed Feature Prompting (DeFo). DeFo leverages a flexible number of learnable embeddings as textual input while maintaining the vision-language dual-model architecture, which enables the model to learn decomposed visual features with the help of feature-level textual prompts. We further use an additional linear layer to perform classification, allowing a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning method by 7.6%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题