Wukong：1亿个大型中国跨模式预训练基准

论文标题

Wukong：1亿个大型中国跨模式预训练基准

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

论文作者

Gu, Jiaxi, Meng, Xiaojun, Lu, Guansong, Hou, Lu, Niu, Minzhe, Liang, Xiaodan, Yao, Lewei, Huang, Runhui, Zhang, Wei, Jiang, Xin, Xu, Chunjing, Xu, Hang

论文摘要

视觉语言预训练（VLP）模型在各种下游任务上表现出色。他们的成功在很大程度上取决于预训练的跨模式数据集的规模。但是，中文中缺乏大规模数据集和基准阻碍了中国VLP模型的发展和更广泛的多语言应用程序。在这项工作中，我们发布了一个名为Wukong的大型中国跨模式数据集，其中包含从网络收集的1亿个中国图像文本对。 Wukong旨在基准基准不同的多模式预训练方法，以促进VLP研究和社区发展。此外，我们释放了通过各种图像编码器（VIT-B/VIT-L/SWINT）预先训练的一组模型，还将高级预训练技术应用于VLP，例如锁定图像文本调整，对比度学习中的标记相似性，以及减少token的相互作用。还提供了广泛的实验和不同下游任务的基准测试，包括新的最大人验证的图像文本测试数据集。实验表明，Wukong可以作为不同的跨模式学习方法的有前途的中国培训前数据集和基准。对于10个数据集上的零摄像图像分类任务，$ wukong_ {vit-l} $的平均准确度为73.03％。对于图像文本检索任务，它在AIC-ICC上的平均召回率为71.6％，比Wenlan 2.0高12.9％。此外，我们的Wukong模型在下游任务上进行了基准测试，例如多个数据集上的其他变体，例如Flickr8k-CN，Flickr-30K-CN，Coco-CN等。更多信息可以参考：https：//wukong-dataset.github.io/wukong-dataset/。

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题