MiniVLM: A Smaller and Faster Vision-Language Model

by   Jianfeng Wang, et al.

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by 95%, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding 7M Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by 73% and the inference time cost by 94% while being able to retain 94-97% of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.


LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning

While pre-training and fine-tuning, e.g., BERT <cit.>, GPT-2 <cit.>, hav...

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Visual and linguistic pre-training aims to learn vision and language rep...

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

3D vision-language grounding (3D-VL) is an emerging field that aims to c...

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

Self-supervised learning (SSL) is a powerful technique for learning repr...

StorSeismic: A new paradigm in deep learning for seismic processing

Machine learned tasks on seismic data are often trained sequentially and...

Towards Efficient Visual Adaption via Structural Re-parameterization

Parameter-efficient transfer learning (PETL) is an emerging research spo...

AutoFedNLP: An efficient FedNLP framework

Transformer-based pre-trained models have revolutionized NLP for superio...

Please sign up or login with your details

Forgot password? Click here to reset