SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

08/24/2021
by   Zirui Wang, et al.
6

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74 (+1.37 Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

READ FULL TEXT
research
12/09/2021

MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Large-scale pretraining is fast becoming the norm in Vision-Language (VL...
research
04/23/2022

Training and challenging models for text-guided fashion image retrieval

Retrieving relevant images from a catalog based on a query image togethe...
research
04/19/2020

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

Numerous recent works have proposed pretraining generic visio-linguistic...
research
12/05/2019

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Prior work in visual dialog has focused on training deep neural models o...
research
05/20/2023

Patton: Language Model Pretraining on Text-Rich Networks

A real-world text corpus sometimes comprises not only text documents but...
research
04/26/2023

Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Medical artificial general intelligence (MAGI) enables one foundation mo...
research
09/13/2023

VLSlice: Interactive Vision-and-Language Slice Discovery

Recent work in vision-and-language demonstrates that large-scale pretrai...

Please sign up or login with your details

Forgot password? Click here to reset