Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

by   Thao Nguyen, et al.

Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design.


page 26

page 29

page 30

page 31

page 32


On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

Pre-training has been widely adopted in deep learning to improve model p...

Pre-training also Transfers Non-Robustness

Pre-training has enabled many state-of-the-art results on many tasks. In...

On the importance of pre-training data volume for compact language models

Recent advances in language modeling have led to computationally intensi...

Achieving Representative Data via Convex Hull Feasibility Sampling Algorithms

Sampling biases in training data are a major source of algorithmic biase...

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Contrastively trained image-text models such as CLIP, ALIGN, and BASIC h...

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

Current trends to pre-train capable Large Language Models (LLMs) mostly ...

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

A salient characteristic of large pre-trained language models (PTLMs) is...

Please sign up or login with your details

Forgot password? Click here to reset