nlip：噪声刺激性语言图像预训练

论文标题

nlip：噪声刺激性语言图像预训练

NLIP: Noise-robust Language-Image Pre-training

论文作者

Huang, Runhui, Long, Yanxin, Han, Jianhua, Xu, Hang, Liang, Xiwen, Xu, Chunjing, Liang, Xiaodan

论文摘要

大规模的跨模式预训练范例最近在多种下游任务上显示了无处不在的成功，例如，零拍，检索和图像字幕。但是，他们的成功很大程度上取决于自然包含不完整和嘈杂信息的网络爬行数据的规模和质量（例如，错误或无关的内容）。现有的作品要么设计手动规则以清洁数据，要么生成伪目标作为减少噪声影响的辅助信号，这并不能同时明确应对不正确和不完整的挑战。在本文中，为了通过仅在现有数据上进行开采来自动减轻噪声的影响，我们提出了一个原则上的噪声噪声图像图像预训练框架（NLIP），以通过两种方案稳定预训练：噪声 - 敲击和噪声填充。首先，在噪声敲击方案中，NLIP根据跨模式变压器的记忆效应估算每对的噪声概率，然后采用噪声自动适时化，以在不同程度上统一跨模式的比对。其次，在噪声完成方案中，为了丰富文本的缺失对象信息，NLIP注入了一个概念条件的跨模式解码器，以获取语义一致的合成字幕以完成嘈杂的字幕，以使用检索到的视觉概念（即对象的名称）用于指导字幕产生。通过协作优化噪声障碍和噪声完成方案，我们的NLIP可以以更有效的方式减轻图像读写预训练期间的常见噪声效应。广泛的实验表明，仅使用26M数据（例如，在12个零摄像机分类数据集，Mscoco Image图像字幕和零照片图像text检索任务）上使用现有预训练的模型（例如夹子，FILIP和BLIP）的26M数据的显着性能提高。

Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题