Efficient Feature Distillation for Zero-shot Detection

by   Zhuoming Liu, et al.

The large-scale vision-language models (e.g., CLIP) are leveraged by different methods to detect unseen objects. However, most of these works require additional captions or images for training, which is not feasible in the context of zero-shot detection. In contrast, the distillation-based method is an extra-data-free method, but it has its limitations. Specifically, existing work creates distillation regions that are biased to the base categories, which limits the distillation of novel category information and harms the distillation efficiency. Furthermore, directly using the raw feature from CLIP for distillation neglects the domain gap between the training data of CLIP and the detection datasets, which makes it difficult to learn the mapping from the image region to the vision-language feature space - an essential component for detecting unseen objects. As a result, existing distillation-based methods require an excessively long training schedule. To solve these problems, we propose Efficient feature distillation for Zero-Shot Detection (EZSD). Firstly, EZSD adapts the CLIP's feature space to the target detection domain by re-normalizing CLIP to bridge the domain gap; Secondly, EZSD uses CLIP to generate distillation proposals with potential novel instances, to avoid the distillation being overly biased to the base categories. Finally, EZSD takes advantage of semantic meaning for regression to further improve the model performance. As a result, EZSD achieves state-of-the-art performance in the COCO zero-shot benchmark with a much shorter training schedule and outperforms previous work by 4 setting with 1/10 training time.


page 3

page 4

page 8

page 12


ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation

Real-world object sampling produces long-tailed distributions requiring ...

Zero-Shot Detection via Vision and Language Knowledge Distillation

Zero-shot image classification has made promising progress by training t...

Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

Open-vocabulary object detection aims to detect novel object categories ...

Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibration

Zero-shot learning (ZSL) for image classification focuses on recognizing...

Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation

We study universal zero-shot segmentation in this work to achieve panopt...

Audio-free Prompt Tuning for Language-Audio Models

Contrastive Language-Audio Pretraining (CLAP) is pre-trained to associat...

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

In this paper, we investigate the task of zero-shot human-object interac...

Please sign up or login with your details

Forgot password? Click here to reset