GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

by   Xinchi Deng, et al.

Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3 top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2 recall on Flickr30K dataset.


page 1

page 2

page 3

page 4


Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval

Contrastive language-image pre-training (CLIP) has demonstrated remarkab...

Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

Vision-language pre-training (VLP) relying on large-scale pre-training d...

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Large-scale vision-language pre-training has achieved promising results ...

CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention

Contrastive Language-Image Pre-training (CLIP) has been shown to learn v...

NLIP: Noise-robust Language-Image Pre-training

Large-scale cross-modal pre-training paradigms have recently shown ubiqu...

Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of m...

Contrastive Language, Action, and State Pre-training for Robot Learning

In this paper, we introduce a method for unifying language, action, and ...

Please sign up or login with your details

Forgot password? Click here to reset