Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

03/10/2023
by   Tom van Sonsbeek, et al.
4

Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2023

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

In this paper, we focus on the problem of Medical Visual Question Answer...
research
06/30/2023

Multimodal Prompt Retrieval for Generative Visual Question Answering

Recent years have witnessed impressive results of pre-trained vision-lan...
research
11/11/2022

MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering

There is a key problem in the medical visual question answering task tha...
research
06/25/2021

A Picture May Be Worth a Hundred Words for Visual Question Answering

How far can we go with textual representations for understanding picture...
research
08/16/2023

Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection

Hateful meme detection is a challenging multimodal task that requires co...
research
08/18/2023

PUMGPT: A Large Vision-Language Model for Product Understanding

Recent developments of multi-modal large language models have demonstrat...
research
03/02/2023

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

Recently, finetuning pretrained vision-language models (VLMs) has become...

Please sign up or login with your details

Forgot password? Click here to reset