场景后面：揭示预先训练的视觉和语言模型的秘密

论文标题

场景后面：揭示预先训练的视觉和语言模型的秘密

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

论文作者

Cao, Jize, Gan, Zhe, Cheng, Yu, Yu, Licheng, Chen, Yen-Chun, Liu, Jingjing

论文摘要

最近的基于变压器的大规模预训练模型已彻底改变了视力和语言（V+L）研究。 Vilbert，LXMERT和UNITER等模型在广泛的V+L基准测试基准中具有显着提升的最新技术，并具有联合图像文本预训练。然而，对实现其令人印象深刻成功的内在机制知之甚少。为了揭示这些强大模型场景背后的秘密，我们提出了价值（视觉和语言理解评估），一组精心设计的探针任务（例如，视觉核心分辨率，视觉核心分辨率，视觉关系检测，语言探测任务）可推广到标准预培养的V+L模型，旨在解密各种知识的内在工作（E. E. e.G）（E.G）（E.G）（E.G）的内在（e.g）（e.g）（E.G）（e.g）（E.G）（E.G）（E.G。通过上下文化的多模式嵌入学到的跨模式对齐）。通过通过这些探测任务对每个原型模型体系结构进行广泛的分析，我们的主要观察结果是：（i）预训练的模型表现出在推断期间进行文本而不是图像的倾向。（ii）存在着针对捕获跨模式相互作用的注意力头的一部分。（iii）在预训练模型中学习的注意力矩阵展示了与图像区域和文本单词之间的潜在对齐相一致的模式。（iv）绘制的注意力模式揭示了图像区域之间的视觉解释关系。（v）纯语言知识也有效地编码在注意力头上。这些是有价值的见解，可指导未来的工作，以设计更好的模型架构和多模式预训练目标。

Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks with joint image-text pre-training. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e.g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题