感知，地面，理性和行动：通用视觉表示的基准

论文标题

感知，地面，理性和行动：通用视觉表示的基准

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

论文作者

Huang, Jiangyong, Zhu, William Yicheng, Jia, Baoxiong, Wang, Zan, Ma, Xiaojian, Li, Qing, Huang, Siyuan

论文摘要

当前的计算机视觉模型与人类视觉系统不同，目前尚无法获得通用的视觉理解。在评估任务的范围内，现有的创建一般视觉模型的努力受到限制，并且不提供总体框架来整体执行它们。我们提出了一种新的全面基准，通用的视觉理解评估（G-VUE），涵盖了具有四个功能域$ \ unicode {x2014} $感知，地面，理性和行为的全部视觉认知能力。从3D重建到视觉推理和操纵，这四个域体现在11个精心策划的任务中。除基准外，我们还提供了一个通用的编码器框架，以评估所有11个任务上的任意视觉表示。我们通过我们的框架评估了各种预训练的视觉表示，并观察到（1）基于变压器的视觉主链通常在G-Vue上胜过基于CNN的主链，（2）来自视觉语言预训练的视觉表示优于那些跨视觉任务具有视觉预先训练的视觉预先训练的主体。通过G-Vue，我们提供了整体评估标准，以通过获得更多通用的视觉表示来激励研究构建通用视觉系统。

Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题