Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

06/10/2023
by   Li Xu, et al.
0

With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, vision-language pre-training (VLP) has become an active area of research and proven to be effective for various VL tasks such as visual-question answering. However, studies on VLP in the medical domain have so far been scanty. To provide a comprehensive perspective on VLP for medical VL tasks, we conduct a thorough experimental analysis to study key factors that may affect the performance of VLP with a unified vision-language Transformer. To allow making sound and quick pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from an open-access online database MedPix. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval. By utilizing RGC and other available datasets for pre-training, we develop several key insights that can guide future medical VLP research and new strong baselines for various medical VL tasks.

READ FULL TEXT

page 2

page 5

research
05/24/2021

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Recently a number of studies demonstrated impressive performance on dive...
research
11/28/2022

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

Current computer vision models, unlike the human visual system, cannot y...
research
08/10/2022

Alternating Cross-attention Vision-Language Model for Efficient Learning with Medical Image and Report without Curation

Recent advances in vision-language pre-training have demonstrated astoun...
research
09/21/2023

Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches

The increase in the availability of online videos has transformed the wa...
research
02/14/2023

UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

This work presents a unified knowledge protocol, called UKnow, which fac...
research
09/15/2022

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Medical vision-and-language pre-training (Med-VLP) has received consider...
research
03/17/2022

DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training

Due to the limitations of the model structure and pre-training objective...

Please sign up or login with your details

Forgot password? Click here to reset