TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

04/15/2023
by   Jingyao Li, et al.
0

Recent success of Contrastive Language-Image Pre-training (CLIP) has shown great promise in pixel-level open-vocabulary learning tasks. A general paradigm utilizes CLIP's text and patch embeddings to generate semantic masks. However, existing models easily misidentify input pixels from unseen classes, thus confusing novel classes with semantically-similar ones. In our work, we disentangle the ill-posed optimization problem into two parallel processes: one performs semantic matching individually, and the other judges reliability for improving discrimination ability. Motivated by special tokens in language modeling that represents sentence-level embeddings, we design a trusty token that decouples the known and novel category prediction tendency. With almost no extra overhead, we upgrade the pixel-level generalization capacity of existing models effectively. Our TagCLIP (CLIP adapting with Trusty-guidance) boosts the IoU of unseen classes by 7.4

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro