Investigating the Catastrophic Forgetting in Multimodal Large Language Models

by   Yuexiang Zhai, et al.

Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement.


page 19

page 20


How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?

The fine-tuning of pre-trained language models has a great success in ma...

Avoiding catastrophic forgetting in mitigating model biases in sentence-pair classification with elastic weight consolidation

The biases present in training datasets have been shown to be affecting ...

Sapinet: A sparse event-based spatiotemporal oscillator for learning in the wild

We introduce Sapinet – a spike timing (event)-based multilayer neural ne...

Multimodal Side-Tuning for Document Classification

In this paper, we propose to exploit the side-tuning framework for multi...

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

The availability of large pre-trained models is changing the landscape o...

ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data

In this study, we demonstrate the application of a hybrid Vision Transfo...

Revisiting a kNN-based Image Classification System with High-capacity Storage

In existing image classification systems that use deep neural networks, ...

Please sign up or login with your details

Forgot password? Click here to reset