Visually-augmented pretrained language models for NLP tasks without images

by   Hangyu Guo, et al.
Renmin University of China
Harbin Institute of Technology
NetEase, Inc

Although pre-trained language models (PLMs) have shown impressive performance by text-only self-supervised training, they are found lack of visual semantics or commonsense, e.g., sizes, shapes, and colors of commonplace objects. Existing solutions often rely on explicit images for visual knowledge augmentation (requiring time-consuming retrieval or generation), and they also conduct the augmentation for the whole input text, without considering whether it is actually needed in specific inputs or tasks. To address these issues, we propose a novel visually-augmented fine-tuning approach that can be generally applied to various PLMs or NLP tasks, without using any retrieved or generated images, namely VAWI. Specifically, we first identify the visually-hungry words (VH-words) from input text via a token selector, where three different methods have been proposed, including syntax-, attention- and learning-based strategies. Then, we adopt a fixed CLIP text encoder to generate the visually-augmented representations of these VH-words. As it has been pre-trained by vision-language alignment task on the large-scale corpus, it is capable of injecting visual semantics into the aligned text representations. Finally, the visually-augmented features will be fused and transformed into the pre-designed visual prompts based on VH-words, which can be inserted into PLMs to enrich the visual semantics in word representations. We conduct extensive experiments on ten NLP tasks, i.e., GLUE benchmark, CommonsenseQA, CommonGen, and SNLI-VE. Experimental results show that our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales, and outperform several competitive baselines significantly. Our codes and data are publicly available at <>.


Visually-Augmented Language Modeling

Human language is grounded on multimodal knowledge including visual know...

Learning to Imagine: Visually-Augmented Natural Language Generation

People often imagine relevant scenes to aid in the writing process. In t...

Understanding Mobile GUI: from Pixel-Words to Screen-Sentences

The ubiquity of mobile phones makes mobile GUI understanding an importan...

Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models

Previous works show that deep NLP models are not always conceptually sou...

Randomized Smoothing with Masked Inference for Adversarially Robust Text Classifications

Large-scale pre-trained language models have shown outstanding performan...

Experiments with adversarial attacks on text genres

Neural models based on pre-trained transformers, such as BERT or XLM-RoB...

LMdiff: A Visual Diff Tool to Compare Language Models

While different language models are ubiquitous in NLP, it is hard to con...

Please sign up or login with your details

Forgot password? Click here to reset