Fine-Tuning Language Models with Just Forward Passes

by   Sadhika Malladi, et al.

Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.


page 2

page 4


Gender-tuning: Empowering Fine-tuning for Debiasing Pre-trained Language Models

Recent studies have revealed that the widely-used Pre-trained Language M...

Full Parameter Fine-tuning for Large Language Models with Limited Resources

Large Language Models (LLMs) have revolutionized Natural Language Proces...

How to Fine-Tune Vision Models with SGD

SGD (with momentum) and AdamW are the two most used optimizers for fine-...

Petals: Collaborative Inference and Fine-tuning of Large Models

Many NLP tasks benefit from using large language models (LLMs) that ofte...

eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models

Since Large Language Models or LLMs have demonstrated high-quality perfo...

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Autoregressive Transformers adopted in Large Language Models (LLMs) are ...

Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs

As foundation models continue to exponentially scale in size, efficient ...

Please sign up or login with your details

Forgot password? Click here to reset