超越语言：从图像中学习常识以推理

论文标题

超越语言：从图像中学习常识以推理

Beyond Language: Learning Commonsense from Images for Reasoning

论文作者

Cui, Wanqing, Lan, Yanyan, Pang, Liang, Guo, Jiafeng, Cheng, Xueqi

论文摘要

本文提出了一种新颖的方法，可以从图像中学习常识，而不是有限的原始文本或昂贵的知识库，以解决NLP中的常识性推理问题。我们的动机来自一个事实，即图像值得一千个单词，可以利用更丰富的场景信息来帮助提炼常识知识，而常识性知识通常被隐藏在语言中。我们的方法，即卢瓦尔，包括两个阶段。在第一阶段，基于文本表示模型Vibert，使用双模式序列到序列方法来执行场景布局生成任务。这样，所需的视觉场景知识（例如空间关系）将通过监督学习过程在Vibert中使用一些双模式数据（例如Coco）编码。然后，Vibert与预训练的语言模型相连，以执行下游常识性推理任务。关于两个常识性推理问题的实验结果，即常识性问题回答和代词解决方案，表明Loire的表现优于传统的基于语言的方法。我们还提供一些案例研究，以显示从图像中学到的知识，并解释生成的场景布局如何帮助常识性推理过程。

This paper proposes a novel approach to learn commonsense from images, instead of limited raw texts or costly constructed knowledge bases, for the commonsense reasoning problem in NLP. Our motivation comes from the fact that an image is worth a thousand words, where richer scene information could be leveraged to help distill the commonsense knowledge, which is often hidden in languages. Our approach, namely Loire, consists of two stages. In the first stage, a bi-modal sequence-to-sequence approach is utilized to conduct the scene layout generation task, based on a text representation model ViBERT. In this way, the required visual scene knowledge, such as spatial relations, will be encoded in ViBERT by the supervised learning process with some bi-modal data like COCO. Then ViBERT is concatenated with a pre-trained language model to perform the downstream commonsense reasoning tasks. Experimental results on two commonsense reasoning problems, i.e. commonsense question answering and pronoun resolution, demonstrate that Loire outperforms traditional language-based methods. We also give some case studies to show what knowledge is learned from images and explain how the generated scene layout helps the commonsense reasoning process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题