Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

10/29/2020
by   Saurabh Agarwal, et al.
1

Distributed model training suffers from communication bottlenecks due to frequent model updates transmitted across compute nodes. To alleviate these bottlenecks, practitioners use gradient compression techniques like sparsification, quantization, or low-rank updates. The techniques usually require choosing a static compression ratio, often requiring users to balance the trade-off between model accuracy and per-iteration speedup. In this work, we show that such performance degradation due to choosing a high compression ratio is not fundamental. An adaptive compression strategy can reduce communication while maintaining final test accuracy. Inspired by recent findings on critical learning regimes, in which small gradient errors can have irrecoverable impact on model performance, we propose Accordion a simple yet effective adaptive compression algorithm. While Accordion maintains a high enough compression rate on average, it avoids over-compressing gradients whenever in critical learning regimes, detected by a simple gradient-norm based criterion. Our extensive experimental study over a number of machine learning tasks in distributed environments indicates that Accordion, maintains similar model accuracy to uncompressed training, yet achieves up to 5.5x better compression and up to 4.1x end-to-end speedup over static approaches. We show that Accordion also works for adjusting the batch size, another popular strategy for alleviating communication bottlenecks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/05/2021

Pufferfish: Communication-efficient Models At No Extra Cost

To mitigate communication overheads in distributed model training, sever...
research
10/31/2022

L-GreCo: An Efficient and General Framework for Layerwise-Adaptive Gradient Compression

Data-parallel distributed training of deep neural networks (DNN) has gai...
research
05/20/2023

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

Distributed data-parallel (DDP) training improves overall application th...
research
05/31/2019

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

We study gradient compression methods to alleviate the communication bot...
research
02/06/2023

U-Clip: On-Average Unbiased Stochastic Gradient Clipping

U-Clip is a simple amendment to gradient clipping that can be applied to...
research
08/11/2022

Quantized Adaptive Subgradient Algorithms and Their Applications

Data explosion and an increase in model size drive the remarkable advanc...
research
02/28/2021

On the Utility of Gradient Compression in Distributed Training Systems

Rapid growth in data sets and the scale of neural network architectures ...

Please sign up or login with your details

Forgot password? Click here to reset