Improved baselines for vision-language pre-training

by   Enrico Fini, et al.

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25 relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4 being substantially simpler.


page 1

page 2

page 3

page 4


CLAR: Contrastive Learning of Auditory Representations

Learning rich visual representations using contrastive self-supervised l...

Self-Supervised Video Representation Using Pretext-Contrastive Learning

Pretext tasks and contrastive learning have been successful in self-supe...

CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-supervised learning of speech representations

While Self-Supervised Learning has helped reap the benefit of the scale ...

Is Self-Supervised Learning More Robust Than Supervised Learning?

Self-supervised contrastive learning is a powerful tool to learn visual ...

Improved Baselines with Momentum Contrastive Learning

Contrastive unsupervised learning has recently shown encouraging progres...

Weighted Ensemble Self-Supervised Learning

Ensembling has proven to be a powerful technique for boosting model perf...

Saliency Can Be All You Need In Contrastive Self-Supervised Learning

We propose an augmentation policy for Contrastive Self-Supervised Learni...

Please sign up or login with your details

Forgot password? Click here to reset