灯：标签增强多模式预处理

论文标题

灯：标签增强多模式预处理

LAMP: Label Augmented Multimodal Pretraining

论文作者

Guo, Jia, Zhu, Chen, Zhao, Yilun, Wang, Heda, Hu, Yao, He, Xiaofei, Cai, Deng

论文摘要

多模式表示通过预处理学习已成为越来越多的兴趣，因为它易于使用和对各种视觉语言〜（V-L）任务的潜在利益。然而，它对大容量和高质量视觉语言对的要求在实践中极大地阻碍了其价值。在本文中，我们提出了一种新型标签的V-L预训练模型，称为LAMP，以解决此问题。具体而言，我们利用了视觉对象的自动生成标签，以富集视觉对对齐对，并相应地设计了一项新颖的预读任务。此外，我们还发现，在第二阶段预处理中的这种标签增强将进一步普遍受益于各种下游任务。为了评估LAMP，我们将其与四个下游任务中的一些最新模型进行了比较。定量结果和分析已经很好地证明了标签在V-L预训练和LAMP的有效性中的值。

Multi-modal representation learning by pretraining has become an increasing interest due to its easy-to-use and potential benefit for various Visual-and-Language~(V-L) tasks. However its requirement of large volume and high-quality vision-language pairs highly hinders its values in practice. In this paper, we proposed a novel label-augmented V-L pretraining model, named LAMP, to address this problem. Specifically, we leveraged auto-generated labels of visual objects to enrich vision-language pairs with fine-grained alignment and correspondingly designed a novel pretraining task. Besides, we also found such label augmentation in second-stage pretraining would further universally benefit various downstream tasks. To evaluate LAMP, we compared it with some state-of-the-art models on four downstream tasks. The quantitative results and analysis have well proven the value of labels in V-L pretraining and the effectiveness of LAMP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题