Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

by   Jiantao Wu, et al.

Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.


page 2

page 3

page 7

page 10

page 12


Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

Self-supervision has shown outstanding results for natural language proc...

Learning Dense Object Descriptors from Multiple Views for Low-shot Category Generalization

A hallmark of the deep learning era for computer vision is the successfu...

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Self-supervised learning for computer vision has achieved tremendous pro...

What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation

While semantic segmentation has seen tremendous improvements in the past...

Recursive Training for Zero-Shot Semantic Segmentation

General purpose semantic segmentation relies on a backbone CNN network t...

One-Shot Transfer of Affordance Regions? AffCorrs!

In this work, we tackle one-shot visual search of object parts. Given a ...

Exploring the Versatility of Zero-Shot CLIP for Interstitial Lung Disease Classification

Interstitial lung diseases (ILD) present diagnostic challenges due to th...

Please sign up or login with your details

Forgot password? Click here to reset