iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition

by   Yixuan Wei, et al.

Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade. Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition. We believe that these two learning tasks are complementary, and suggest combining them for better visual learning. We propose a deep fusion method with three adaptations that effectively bridge two learning tasks, rather than shallow fusion through naive multi-task learning. First, we modify the previous common practice in image classification, a linear classifier, with a cosine classifier which shows comparable performance. Second, we convert the image classification problem from learning parametric category classifier weights to learning a text encoder as a meta network to generate category classifier weights. The learnt text encoder is shared between image classification and image-text alignment. Third, we enrich each class name with a description to avoid confusion between classes and make the classification method closer to the image-text alignment. We prove that this deep fusion approach performs better on a variety of visual recognition tasks and setups than the individual learning or shallow fusion approach, from zero-shot/few-shot image classification, such as the Kornblith 12-dataset benchmark, to downstream tasks of action recognition, semantic segmentation, and object detection in fine-tuning and open-vocabulary settings. The code will be available at https://github.com/weiyx16/iCAR.


page 1

page 2

page 3

page 4


RegionCLIP: Region-based Language-Image Pretraining

Contrastive language-image pretraining (CLIP) using image-text pairs has...

Exploiting the relationship between visual and textual features in social networks for image classification with zero-shot deep learning

One of the main issues related to unsupervised machine learning is the c...

Image-free Classifier Injection for Zero-Shot Classification

Zero-shot learning models achieve remarkable results on image classifica...

A Task-guided, Implicitly-searched and Meta-initialized Deep Model for Image Fusion

Image fusion plays a key role in a variety of multi-sensor-based vision ...

What does a platypus look like? Generating customized prompts for zero-shot image classification

Open vocabulary models are a promising new paradigm for image classifica...

Patching open-vocabulary models by interpolating weights

Open-vocabulary models like CLIP achieve high accuracy across many image...

Alignment Based Matching Networks for One-Shot Classification and Open-Set Recognition

Deep learning for object classification relies heavily on convolutional ...

Please sign up or login with your details

Forgot password? Click here to reset