论文标题
CCMB:大规模的中国跨模式基准
CCMB: A Large-scale Chinese Cross-modal Benchmark
论文作者
论文摘要
大规模数据集上的视觉语言预训练(VLP)显示了各种下游任务的首要表现。与大量具有英语语料库的可用基准相反,大规模的预训练数据集和带有中国语料库的下游数据集仍然在很大程度上没有探索。在这项工作中,我们为研究界建立了一个名为CCMB的大规模高质量的中国跨模式基准,该基准包含目前最大的公共培训培训数据集零和五个用于下游任务的人类通知的微调数据集。零包含2.5亿张图像,并配对7.5亿个文本说明,加上五个微型数据集中的两个目前也是中国跨模式下游任务的最大图像。与CCMB一起,我们还开发了一个名为R2D2的VLP框架,应用了预先级别的 +排名策略来学习强大的视觉语言表示和双向蒸馏方法(即目标引导的蒸馏和特征引导的蒸馏),以进一步增强学习能力。使用零和R2D2 VLP框架,我们可以从五个广泛的任务中的十二个下游数据集上实现最新性能,包括图像文本检索,图像文本匹配,图像字幕,文本对图像生成,文本对象生成和零击图像分类。数据集,模型和代码可在https://github.com/yuxie11/r2d2上找到
Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations and a two-way distillation method (i.e., target-guided Distillation and feature-guided Distillation) to further enhance the learning capability. With the Zero and the R2D2 VLP framework, we achieve state-of-the-art performance on twelve downstream datasets from five broad categories of tasks including image-text retrieval, image-text matching, image caption, text-to-image generation, and zero-shot image classification. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2