Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer

03/14/2022
by   Woojeong Jin, et al.
0

Pre-trained language models are still far from human performance in tasks that need understanding of properties (e.g. appearance, measurable quantity) and affordances of everyday objects in the real world since the text lacks such information due to reporting bias. In this work, we study whether integrating visual knowledge into a language model can fill the gap. We investigate two types of knowledge transfer: (1) text knowledge transfer using image captions that may contain enriched visual knowledge and (2) cross-modal knowledge transfer using both images and captions with vision-language training objectives. On 5 downstream tasks that may need visual knowledge to solve the problem, we perform extensive empirical comparisons over the presented objectives. Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/12/2023

Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters

Humans learn language via multi-modal knowledge. However, due to the tex...
research
05/23/2023

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Pre-trained vision-language models are the de-facto foundation models fo...
research
08/19/2023

An Empirical Study of CLIP for Text-based Person Search

Text-based Person Search (TBPS) aims to retrieve the person images using...
research
09/05/2021

Data Efficient Masked Language Modeling for Vision and Language

Masked language modeling (MLM) is one of the key sub-tasks in vision-lan...
research
06/30/2020

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint repr...
research
04/18/2022

Imagination-Augmented Natural Language Understanding

Human brains integrate linguistic and perceptual information simultaneou...
research
05/17/2023

Probing the Role of Positional Information in Vision-Language Models

In most Vision-Language models (VL), the understanding of the image stru...

Please sign up or login with your details

Forgot password? Click here to reset