Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt Augmentation and Text-To-Image Diffusion

07/09/2023
by   Jie S. Li, et al.
0

This paper describes our zero-shot approaches for the Visual Word Sense Disambiguation (VWSD) Task in English. Our preliminary study shows that the simple approach of matching candidate images with the phrase using CLIP suffers from the many-to-many nature of image-text pairs. We find that the CLIP text encoder may have limited abilities in capturing the compositionality in natural language. Conversely, the descriptive focus of the phrase varies from instance to instance. We address these issues in our two systems, Augment-CLIP and Stable Diffusion Sampling (SD Sampling). Augment-CLIP augments the text prompt by generating sentences that contain the context phrase with the help of large language models (LLMs). We further explore CLIP models in other languages, as the an ambiguous word may be translated into an unambiguous one in the other language. SD Sampling uses text-to-image Stable Diffusion to generate multiple images from the given phrase, increasing the likelihood that a subset of images match the one that paired with the text.

READ FULL TEXT
research
11/29/2021

Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Recent text-to-image matching models apply contrastive learning to large...
research
10/11/2022

CLIP also Understands Text: Prompting CLIP for Phrase Understanding

Contrastive Language-Image Pretraining (CLIP) efficiently learns visual ...
research
03/28/2023

Your Diffusion Model is Secretly a Zero-Shot Classifier

The recent wave of large-scale text-to-image diffusion models has dramat...
research
03/27/2023

Text-to-Image Diffusion Models are Zero-Shot Classifiers

The excellent generative capabilities of text-to-image diffusion models ...
research
06/15/2023

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

The variety of objects in the real world is nearly unlimited and is thus...
research
05/18/2023

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Achieving machine autonomy and human control often represent divergent o...
research
12/29/2016

Learning Visual N-Grams from Web Data

Real-world image recognition systems need to recognize tens of thousands...

Please sign up or login with your details

Forgot password? Click here to reset