Boosting Visual-Language Models by Exploiting Hard Samples

by   Haonan Wang, et al.

Large vision and language models, such as Contrastive Language-Image Pre-training (CLIP), are rapidly becoming the industry norm for matching images and texts. In order to improve its zero-shot recognition performance, current research either adds additional web-crawled image-text pairs or designs new training losses. However, the additional costs associated with training from scratch and data collection substantially hinder their deployment. In this paper, we present HELIP, a low-cost strategy for boosting the performance of well-trained CLIP models by finetuning them with hard samples over original training data. Mixing hard examples into each batch, the well-trained CLIP model is then fine-tuned using the conventional contrastive alignment objective and a margin loss to distinguish between normal and hard negative data. HELIP is deployed in a plug-and-play fashion to existing models. On a comprehensive zero-shot and retrieval benchmark, without training the model from scratch or utilizing additional data, HELIP consistently boosts existing models to achieve leading performance. In particular, HELIP boosts ImageNet zero-shot accuracy of SLIP by 3.05 and 4.47 when pretrained on CC3M and CC12M respectively. In addition, a systematic evaluation of zero-shot and linear probing experiments across fine-grained classification datasets demonstrates a consistent performance improvement and validates the efficacy of HELIP . When pretraining on CC3M, HELIP boosts zero-shot performance of CLIP and SLIP by 8.4% and 18.6% on average respectively, and linear probe performance by 9.5% and 3.0% on average respectively.


page 4

page 8

page 9

page 12


Understanding Zero-Shot Adversarial Robustness for Large-Scale Models

Pretrained large-scale vision-language models like CLIP have exhibited s...

Generative Negative Text Replay for Continual Vision-Language Pretraining

Vision-language pre-training (VLP) has attracted increasing attention re...

CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention

Contrastive Language-Image Pre-training (CLIP) has been shown to learn v...

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

We introduce Three Towers (3T), a flexible method to improve the contras...

Curriculum Learning for Data-Efficient Vision-Language Alignment

Aligning image and text encoders from scratch using contrastive learning...

Exploring Vision-Language Models for Imbalanced Learning

Vision-Language models (VLMs) that use contrastive language-image pre-tr...

CLIP-ReIdent: Contrastive Training for Player Re-Identification

Sports analytics benefits from recent advances in machine learning provi...

Please sign up or login with your details

Forgot password? Click here to reset