Zero-Shot Detection via Vision and Language Knowledge Distillation

by   Xiuye Gu, et al.

Zero-shot image classification has made promising progress by training the aligned image and text encoders. The goal of this work is to advance zero-shot object detection, which aims to detect novel objects without bounding box nor mask annotations. We propose ViLD, a training method via Vision and Language knowledge Distillation. We distill the knowledge from a pre-trained zero-shot image classification model (e.g., CLIP) into a two-stage detector (e.g., Mask R-CNN). Our method aligns the region embeddings in the detector to the text and image embeddings inferred by the pre-trained model. We use the text embeddings as the detection classifier, obtained by feeding category names into the pre-trained text encoder. We then minimize the distance between the region embeddings and image embeddings, obtained by feeding region proposals into the pre-trained image encoder. During inference, we include text embeddings of novel categories into the detection classifier for zero-shot detection. We benchmark the performance on LVIS dataset by holding out all rare categories as novel categories. ViLD obtains 16.1 mask AP_r with a Mask R-CNN (ResNet-50 FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8. The model can directly transfer to other datasets, achieving 72.2 AP_50, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively.


page 1

page 3

page 8

page 9

page 10

page 13

page 14

page 15


ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation

Real-world object sampling produces long-tailed distributions requiring ...

Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

Open-vocabulary object detection aims to detect novel object categories ...

Efficient Feature Distillation for Zero-shot Detection

The large-scale vision-language models (e.g., CLIP) are leveraged by dif...

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

In this paper, we investigate the task of zero-shot human-object interac...

TIER: Text-Image Entropy Regularization for CLIP-style models

In this paper, we study the effect of a novel regularization scheme on c...

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature

The fluency and factual knowledge of large language models (LLMs) height...

GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning

A vision-language foundation model pretrained on very large-scale image-...

Please sign up or login with your details

Forgot password? Click here to reset