UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

01/31/2023
by   Dachuan Shi, et al.
0

Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the Unified and Progressive Pruning (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrieval, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.

READ FULL TEXT

page 6

page 14

research
04/16/2022

Searching Intrinsic Dimensions of Vision Transformers

It has been shown by many researchers that transformers perform as well ...
research
09/01/2021

WebQA: Multihop and Multimodal QA

Web search is fundamentally multimodal and multihop. Often, even before ...
research
02/10/2021

Training Vision Transformers for Image Retrieval

Transformers have shown outstanding results for natural language underst...
research
01/24/2020

Progressive Local Filter Pruning for Image Retrieval Acceleration

This paper focuses on network pruning for image retrieval acceleration. ...
research
08/03/2023

Multimodal Neurons in Pretrained Text-Only Transformers

Language models demonstrate remarkable capacity to generalize representa...
research
06/16/2021

Probing Image-Language Transformers for Verb Understanding

Multimodal image-language transformers have achieved impressive results ...
research
03/28/2022

Automated Progressive Learning for Efficient Training of Vision Transformers

Recent advances in vision Transformers (ViTs) have come with a voracious...

Please sign up or login with your details

Forgot password? Click here to reset