LPN: Language-guided Prototypical Network for few-shot classification

by   Kaihui Cheng, et al.

Few-shot classification aims to adapt to new tasks with limited labeled examples. To fully use the accessible data, recent methods explore suitable measures for the similarity between the query and support images and better high-dimensional features with meta-training and pre-training strategies. However, the potential of multi-modality information has barely been explored, which may bring promising improvement for few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification, which leverages the complementarity of vision and language modalities via two parallel branches. Concretely, to introduce language modality with limited samples in the visual task, we leverage a pre-trained text encoder to extract class-level text features directly from class names while processing images with a conventional image encoder. Then, a language-guided decoder is introduced to obtain text features corresponding to each image by aligning class-level features with visual features. In addition, to take advantage of class-level features and prototypes, we build a refined prototypical head that generates robust prototypes in the text branch for follow-up measurement. Finally, we aggregate the visual and text logits to calibrate the deviation of a single modality. Extensive experiments demonstrate the competitiveness of LPN against state-of-the-art methods on benchmark datasets.


page 3

page 8


Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Text-based Person Search (TPS), is targeted on retrieving pedestrians to...

MVP: Multimodality-guided Visual Pre-training

Recently, masked image modeling (MIM) has become a promising direction f...

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

Compared with ample visual-text pre-training research, few works explore...

Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model

TextVQA requires models to read and reason about text in images to answe...

Few-shot Classification via Ensemble Learning with Multi-Order Statistics

Transfer learning has been widely adopted for few-shot classification. R...

E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP

Contrasting Language-image pertaining (CLIP) has recently shown promisin...

Improved Few-shot Segmentation by Redefinition of the Roles of Multi-level CNN Features

This study is concerned with few-shot segmentation, i.e., segmenting the...

Please sign up or login with your details

Forgot password? Click here to reset