Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

by   Junbum Cha, et al.

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and adapting the learned image-level understanding to the segmentation task. However, these methods based on CL have a discrepancy since it only considers image-text level alignment in training time, while the segmentation task requires region-text level alignment at test time. In this paper, we propose a novel Text-grounded Contrastive Learning (TCL) framework to directly align a text and a region described by the text to address the train-test discrepancy. Our method generates a segmentation mask associated with a given text, extracts grounded image embedding from the masked region, and aligns it with text embedding via TCL. The framework addresses the discrepancy by letting the model learn region-text level alignment instead of image-text level alignment and encourages the model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performance with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.


page 2

page 4

page 7

page 8

page 13

page 14

page 15


MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

Recently, semantic segmentation models trained with image-level text sup...

Language-driven Semantic Segmentation

We present LSeg, a novel model for language-driven semantic image segmen...

Zero-shot Referring Image Segmentation with Global-Local Context Features

Referring image segmentation (RIS) aims to find a segmentation mask give...

Exposing Semantic Segmentation Failures via Maximum Discrepancy Competition

Semantic segmentation is an extensively studied task in computer vision,...

Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Grounded Situation Recognition (GSR) is capable of recognizing and inter...

Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation

In this work, we investigate performing semantic segmentation solely thr...

ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation

Modern supervised semantic segmentation methods are usually finetuned ba...

Please sign up or login with your details

Forgot password? Click here to reset