质量不是数量：关于数据集设计与剪辑鲁棒性之间的相互作用

论文标题

质量不是数量：关于数据集设计与剪辑鲁棒性之间的相互作用

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

论文作者

Nguyen, Thao, Ilharco, Gabriel, Wortsman, Mitchell, Oh, Sewoong, Schmidt, Ludwig

论文摘要

Web爬行的数据集已在最近的图像文本模型（例如剪辑（对比语言图像预训练）或Flamingo）中启用了非凡的概括功能，但是对数据集创建过程知之甚少。在这项工作中，我们介绍了六个可公开可用数据源的测试床 - YFCC，Laion，概念标题，机智，redcaps，shutterstock-，以研究预训练分布如何在剪辑中诱导稳健性。我们发现，预训练数据的性能在分布变化之间有很大的变化，没有单个数据源主导。此外，我们系统地研究了这些数据源之间的相互作用，发现多个来源的组合并不一定会产生更好的模型，而是稀释了最佳个体数据源的鲁棒性。我们将经验发现与简单环境的理论见解相辅相成，其中结合训练数据还会导致稳健性稀释。此外，我们的理论模型为LAION数据集中最近采用的基于夹的数据过滤技术的成功提供了候选解释。总体而言，我们的结果表明，仅仅从Web中收集大量数据并不是建立预训练数据集以进行鲁棒性概括的最有效方法，因此需要进一步研究数据集设计。代码可从https://github.com/mlfoundations/clip_quality_not_quantity获得。

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题