ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber

08/17/2022
by   Manuel Brack, et al.
16

Bootstrapping from pre-trained language models has been proven to be an efficient approach for building foundation vision-language models (VLM) for tasks such as image captioning or visual question answering. However, it is difficult-if not impossible-to utilize it to make the model conform with user's rationales for specific answers. To elicit and reinforce commonsense reasons, we propose an iterative sampling and tuning paradigm, called ILLUME, that executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides minimal feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.

READ FULL TEXT
research
12/13/2021

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Recently, fine-tuning language models pre-trained on large text corpora ...
research
09/15/2021

Transformer-based Language Models for Factoid Question Answering at BioASQ9b

In this work, we describe our experiments and participating systems in t...
research
05/01/2021

When to Fold'em: How to answer Unanswerable questions

We present 3 different question-answering models trained on the SQuAD2.0...
research
09/22/2021

Caption Enriched Samples for Improving Hateful Memes Detection

The recently introduced hateful meme challenge demonstrates the difficul...
research
06/01/2023

Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

The pre-training-fine-tuning paradigm based on layout-aware multimodal p...
research
04/25/2022

Super-Prompting: Utilizing Model-Independent Contextual Data to Reduce Data Annotation Required in Visual Commonsense Tasks

Pre-trained language models have shown excellent results in few-shot lea...
research
05/20/2023

What Makes for Good Visual Tokenizers for Large Language Models?

We empirically investigate proper pre-training methods to build good vis...

Please sign up or login with your details

Forgot password? Click here to reset