X-Detr：用于实例视觉语言任务的多功能体系结构

论文标题

X-Detr：用于实例视觉语言任务的多功能体系结构

X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

论文作者

Cai, Zhaowei, Kwon, Gukyeong, Ravichandran, Avinash, Bas, Erhan, Tu, Zhuowen, Bhotika, Rahul, Soatto, Stefano

论文摘要

在本文中，我们研究了具有挑战性的实例视觉语言任务，其中需要自由形式的语言与对象一致而不是整个图像。为了解决这些任务，我们提出了X-Detr，其体系结构具有三个主要组成部分：对象检测器，语言编码器和视觉语言对齐。视觉和语言流是独立的，直到结束，并且使用有效的点产物操作对齐。整个网络都是端对端训练的，因此检测器已针对视觉任务而不是现成的组件进行了优化。为了克服配对对象语言注释的有限尺寸，我们利用其他弱类型的监督来扩大知识覆盖范围。这种简单而有效的X-DETR体系结构显示出良好的准确性和快速速度，用于多个实例视觉语言任务，例如，LVIS检测1.2k类别的16.4 AP，每秒〜20帧，而无需在训练过程中使用任何LVIS注释。

In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product operation. The whole network is trained end-to-end, such that the detector is optimized for the vision-language tasks instead of an off-the-shelf component. To overcome the limited size of paired object-language annotations, we leverage other weak types of supervision to expand the knowledge coverage. This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at ~20 frames per second without using any LVIS annotation during training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题