学习用表示图表示图像和文本

论文标题

学习用表示图表示图像和文本

Learning to Represent Image and Text with Denotation Graph

论文作者

Zhang, Bowen, Hu, Hexiang, Jain, Vihan, Ie, Eugene, Sha, Fei

论文摘要

学习融合视觉和语言信息并代表它们是许多应用程序的重要研究问题。最近的进步已经利用了变形金刚中的预训练（从语言建模）和注意力层的思想，从包含包含与描述图像的语言表达的图像的数据集中学习表示。在本文中，我们从图像和文本之间的一组隐含的，视觉扎根的表达式中提出了学习表示形式，这些表达式自动从这些数据集中挖掘出来。特别是，我们使用表示图来表示特定概念（例如描述图像的句子）如何链接到也在视觉上基础的抽象和通用概念（例如简短短语）。可以使用语言分析工具发现这种类型的通用关系。我们建议将这种关系纳入学习表示形式。我们表明，通过利用自动收获的结构关系，可以进一步改善最先进的多模式学习模型。表示形式在跨模式图像检索，参考表达和组成属性 - 对象识别的下游任务上导致更强的经验结果。我们在flickr30k和可可数据集上的代码和提取的表示图都在https://sha-lab.github.io/dg上公开可用。

Learning to fuse vision and language information and representing them is an important research problem with many applications. Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers to learn representation from datasets containing images aligned with linguistic expressions that describe the images. In this paper, we propose learning representations from a set of implied, visually grounded expressions between image and text, automatically mined from those datasets. In particular, we use denotation graphs to represent how specific concepts (such as sentences describing images) can be linked to abstract and generic concepts (such as short phrases) that are also visually grounded. This type of generic-to-specific relations can be discovered using linguistic analysis tools. We propose methods to incorporate such relations into learning representation. We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations. The representations lead to stronger empirical results on downstream tasks of cross-modal image retrieval, referring expression, and compositional attribute-object recognition. Both our codes and the extracted denotation graphs on the Flickr30K and the COCO datasets are publically available on https://sha-lab.github.io/DG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题