Learning to Compose Soft Prompts for Compositional Zero-Shot Learning

04/07/2022
by   Nihal V. Nayak, et al.
0

We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs) without the overhead of fine-tuning the entire model. VLMs can represent arbitrary classes as natural language prompts in their flexible text encoders but they underperform state-of-the-art methods on compositional zero-shot benchmark tasks. To improve VLMs, we propose a novel form of soft prompting. We treat the attributes and objects that are composed to define classes as learnable tokens of vocabulary and tune them on multiple prompt compositions. During inference, we recompose the learned attribute-object vocabulary in new combinations and show that CSP outperforms the original VLM on benchmark datasets by an average of 14.7 percentage points of accuracy. CSP also achieves new state-of-the-art accuracies on two out of three benchmark datasets, while only fine-tuning a small number of parameters. Further, we show that CSP improves generalization to higher-order attribute-attribute-object compositions and combinations of pretrained attributes and fine-tuned objects.

READ FULL TEXT
research
11/09/2022

Prompting Large Pre-trained Vision-Language Models For Compositional Concept Learning

This work explores the zero-shot compositional learning ability of large...
research
05/02/2023

DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning

Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts...
research
07/11/2023

Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference

We introduce a synthetic dataset called Sentences Involving Complex Comp...
research
11/23/2022

Open-vocabulary Attribute Detection

Vision-language modeling has enabled open-vocabulary tasks where predict...
research
09/16/2021

Efficient Attribute Injection for Pretrained Language Models

Metadata attributes (e.g., user and product IDs from reviews) can be inc...
research
06/25/2020

A causal view of compositional zero-shot recognition

People easily recognize new visual categories that are new combinations ...
research
08/10/2022

Patching open-vocabulary models by interpolating weights

Open-vocabulary models like CLIP achieve high accuracy across many image...

Please sign up or login with your details

Forgot password? Click here to reset