使用视觉和语言模型来利用未标记的数据进行对象检测

论文标题

使用视觉和语言模型来利用未标记的数据进行对象检测

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

论文作者

Zhao, Shiyu, Zhang, Zhixing, Schulter, Samuel, Zhao, Long, G, Vijay Kumar B., Stathopoulos, Anastasis, Chandraker, Manmohan, Metaxas, Dimitris

论文摘要

构建强大的通用对象检测框架需要扩展到更大的标签空间和更大的培训数据集。但是，获取大规模的数千个类别的注释是高昂的成本。我们提出了一种新颖的方法，该方法利用了最近的视觉和语言模型中可用的丰富语义来将对象本地化和分类为未标记的图像，从而有效地生成了伪标签以进行对象检测。从通用和类别的区域建议机制开始，我们使用视觉和语言模型将图像的每个区域分类为下游任务所需的任何对象类别。我们在两个特定任务（开放式摄影检测检测）中演示了生成的伪标签的价值，其中模型需要概括为看不见的对象类别，以及半监督的对象检测，其中可以使用其他未标记的图像来改善模型。我们的经验评估表明，伪标签在这两个任务中的有效性，在该任务中，我们的表现优于竞争基准，并实现了开放式唱机对象检测的新颖最新。我们的代码可在https://github.com/xiaofeng94/vl-plm上找到。

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection. Our code is available at https://github.com/xiaofeng94/VL-PLM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题