Quantized Distributed Training of Large Models with Convergence Guarantees

02/05/2023
by   Ilia Markov, et al.
0

Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model's weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights, and thus the domain over which we train consists of quantized points and is, therefore, highly non-convex. We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/28/2021

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

As the size and complexity of models and datasets grow, so does the need...
research
12/10/2020

Recurrence of Optimum for Training Weight and Activation Quantized Networks

Deep neural networks (DNNs) are quantized for efficient inference on res...
research
06/02/2022

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

Communication compression is a crucial technique for modern distributed ...
research
04/21/2021

ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

Large-scale distributed training of Deep Neural Networks (DNNs) on state...
research
04/28/2020

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

In data-parallel synchronous training of deep neural networks, different...
research
05/29/2023

Global-QSGD: Practical Floatless Quantization for Distributed Learning with Theoretical Guarantees

Efficient distributed training is a principal driver of recent advances ...
research
12/07/2017

AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

Highly distributed training of Deep Neural Networks (DNNs) on future com...

Please sign up or login with your details

Forgot password? Click here to reset