Improving Generalization of Image Captioning with Unsupervised Prompt Learning

by   Hongchen Wei, et al.

Pretrained visual-language models have demonstrated impressive zero-shot abilities in image captioning, when accompanied by hand-crafted prompts. Meanwhile, hand-crafted prompts utilize human prior knowledge to guide the model. However, due to the diversity between different domains, such hand-crafted prompt that provide invariant prior knowledge may result in mode collapse for some domains. Some researches attempted to incorporate expert knowledge and instruction datasets, but the results were costly and led to hallucinations. In this paper, we propose an unsupervised prompt learning method to improve Generalization of Image Captioning (GeneIC), which learns a domain-specific prompt vector for the target domain without requiring annotated data. GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model, thus optimizing the domain-specific prompt vector from two aspects: attribute and semantic consistency. Specifically, GeneIC first generates attribute-transferred images with differing attributes, while retaining semantic similarity with original images. Then, GeneIC uses CLIP to measure the similarity between the images and the generated sentences. By exploring the variable and invariant features in the original images and attribute-transferred images, attribute consistency constrains the attribute change direction of both images and sentences to learn domain-specific knowledge. The semantic consistency directly measures the similarity between the generated sentences and images to ensure the accuracy and comprehensiveness of the generated sentences. Consequently, GeneIC only optimizes the prompt vectors, which effectively retains the knowledge in the large model and introduces domain-specific knowledge.


Prompt-based Learning for Unpaired Image Captioning

Unpaired Image Captioning (UIC) has been developed to learn image descri...

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Image captioning models require the high-level generalization ability to...

ZstGAN: An Adversarial Approach for Unsupervised Zero-Shot Image-to-Image Translation

Image-to-image translation models have shown remarkable ability on trans...

Give me a hint! Navigating Image Databases using Human-in-the-loop Feedback

In this paper, we introduce an attribute-based interactive image search ...

A Neural Conversational Model

Conversational modeling is an important task in natural language underst...

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Learned joint representations of images and text form the backbone of se...

Evaluation of Correctness in Unsupervised Many-to-Many Image Translation

Given an input image from a source domain and a "guidance" image from a ...

Please sign up or login with your details

Forgot password? Click here to reset