对像素，图像和语言的广义解码

论文标题

对像素，图像和语言的广义解码

Generalized Decoding for Pixel, Image, and Language

论文作者

Zou, Xueyan, Dou, Zi-Yi, Yang, Jianwei, Gan, Zhe, Li, Linjie, Li, Chunyuan, Dai, Xiyang, Behl, Harkirat, Wang, Jianfeng, Yuan, Lu, Peng, Nanyun, Wang, Lijuan, Lee, Yong Jae, Gao, Jianfeng

论文摘要

我们提出了X-Decoder，这是一个通用的解码模型，可以无缝预测像素级分段和语言令牌。 x-decodert作为输入的两种查询：（i）从文本输入引起的通用非语义查询以及（ii）语义查询，以解码在同一语义空间中的不同像素级和令牌级输出。借助如此新颖的设计，X-Decoder是第一部提供统一方法来支持所有类型的图像分割和各种视觉语言（VL）任务的作品。此外，我们的设计可以通过学习常见且丰富的像素级的视觉语义理解空间，在没有任何伪标记的情况下，通过学习常见且丰富的像素级的视觉语义理解空间来实现跨任务的无缝互动。在一组有限的分割数据和数百万张图像对的混合次数上进行了预处理后，X-Decoder在零击和芬太尼设置中均表现出强大的转移性。值得注意的是，（1）在八个数据集中进行开放式摄影片和参考分割方面的最新结果；（2）针对其他通才和专家模型，更好或具有竞争性的履行绩效；（3）灵活性的有效明换和新颖的任务组成（例如，引用字幕和图像编辑）。代码，演示，视频和可视化可在https://x-decoder-vl.github.io上找到。

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.

下载PDF全文

下载文献需遵守相关版权规定

论文标题