论文标题

X-Detr:用于实例视觉语言任务的多功能体系结构

X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

论文作者

Cai, Zhaowei, Kwon, Gukyeong, Ravichandran, Avinash, Bas, Erhan, Tu, Zhuowen, Bhotika, Rahul, Soatto, Stefano

论文摘要

在本文中,我们研究了具有挑战性的实例视觉语言任务,其中需要自由形式的语言与对象一致而不是整个图像。为了解决这些任务,我们提出了X-Detr,其体系结构具有三个主要组成部分:对象检测器,语言编码器和视觉语言对齐。视觉和语言流是独立的,直到结束,并且使用有效的点产物操作对齐。整个网络都是端对端训练的,因此检测器已针对视觉任务而不是现成的组件进行了优化。为了克服配对对象语言注释的有限尺寸,我们利用其他弱类型的监督来扩大知识覆盖范围。这种简单而有效的X-DETR体系结构显示出良好的准确性和快速速度,用于多个实例视觉语言任务,例如,LVIS检测1.2k类别的16.4 AP,每秒〜20帧,而无需在训练过程中使用任何LVIS注释。

In this paper, we study the challenging instance-wise vision-language tasks, where the free-form language is required to align with the objects instead of the whole image. To address these tasks, we propose X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment. The vision and language streams are independent until the end and they are aligned using an efficient dot-product operation. The whole network is trained end-to-end, such that the detector is optimized for the vision-language tasks instead of an off-the-shelf component. To overcome the limited size of paired object-language annotations, we leverage other weak types of supervision to expand the knowledge coverage. This simple yet effective architecture of X-DETR shows good accuracy and fast speeds for multiple instance-wise vision-language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at ~20 frames per second without using any LVIS annotation during training.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源