Multimodal Prompt Retrieval for Generative Visual Question Answering

06/30/2023
by   Timothy Ossowski, et al.
0

Recent years have witnessed impressive results of pre-trained vision-language models on knowledge-intensive tasks such as visual question answering (VQA). Despite the recent advances in VQA, existing methods mainly adopt a discriminative formulation that predicts answers within a pre-defined label set, leading to easy overfitting on low-resource domains with limited labeled data (e.g., medicine) and poor generalization under domain shift to another dataset. To tackle this limitation, we propose a novel generative model enhanced by multimodal prompt retrieval (MPR) that integrates retrieved prompts and multimodal features to generate answers in free text. Our generative model enables rapid zero-shot dataset adaptation to unseen data distributions and open-set answer labels across datasets. Our experiments on medical VQA tasks show that MPR outperforms its non-retrieval counterpart by up to 30 points in a few-shot domain adaptation setting.

READ FULL TEXT

page 3

page 6

page 13

page 15

page 16

research
07/27/2023

Med-Flamingo: a Multimodal Medical Few-shot Learner

Medicine, by its nature, is a multifaceted domain that requires the synt...
research
07/12/2021

Zero-shot Visual Question Answering using Knowledge Graph

Incorporating external knowledge to Visual Question Answering (VQA) has ...
research
03/10/2023

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

Medical Visual Question Answering (VQA) is an important challenge, as it...
research
11/11/2019

Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation

We study the problem of visual question answering (VQA) in images by exp...
research
03/29/2021

Domain-robust VQA with diverse datasets and methods but no target labels

The observation that computer vision methods overfit to dataset specific...
research
05/23/2023

i-Code Studio: A Configurable and Composable Framework for Integrative AI

Artificial General Intelligence (AGI) requires comprehensive understandi...
research
05/24/2022

Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Vision-and-language (V L) models pretrained on large-scale multimodal ...

Please sign up or login with your details

Forgot password? Click here to reset