Contrastive Feature Masking Open-Vocabulary Vision Transformer

09/02/2023
by   Dahun Kim, et al.
0

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 APr, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

READ FULL TEXT
research
05/11/2023

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a...
research
08/04/2023

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Open-vocabulary segmentation is a challenging task requiring segmenting ...
research
03/23/2023

Three ways to improve feature alignment for open vocabulary detection

The core problem in zero-shot open vocabulary detection is how to align ...
research
04/12/2023

RECLIP: Resource-efficient CLIP by Training with Small Images

We present RECLIP (Resource-efficient CLIP), a simple method that minimi...
research
11/23/2022

Open-vocabulary Attribute Detection

Vision-language modeling has enabled open-vocabulary tasks where predict...
research
12/21/2021

Supervised Graph Contrastive Pretraining for Text Classification

Contrastive pretraining techniques for text classification has been larg...
research
08/25/2022

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

This paper presents a simple yet effective framework MaskCLIP, which inc...

Please sign up or login with your details

Forgot password? Click here to reset