Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

by   Zhenheng Tang, et al.

Distributed deep learning becomes very common to reduce the overall training time by exploiting multiple computing devices (e.g., GPUs/TPUs) as the size of deep models and data sets increases. However, data communication between computing devices could be a potential bottleneck to limit the system scalability. How to address the communication problem in distributed deep learning is becoming a hot research topic recently. In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations. In the system-level, we demystify the system design and implementation to reduce the communication cost. In algorithmic-level, we compare different algorithms with theoretical convergence bounds and communication complexity. Specifically, we first propose the taxonomy of data-parallel distributed training algorithms, which contains four main dimensions: communication synchronization, system architectures, compression techniques, and parallelism of communication and computing. Then we discuss the studies in addressing the problems of the four dimensions to compare the communication cost. We further compare the convergence rates of different algorithms, which enable us to know how fast the algorithms can converge to the solution in terms of iterations. According to the system-level communication cost analysis and theoretical convergence speed comparison, we provide the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.


page 1

page 3

page 5


Communication Optimization Strategies for Distributed Deep Learning: A Survey

Recent trends in high-performance computing and deep learning lead to a ...

Communication-Efficient Distributed Deep Learning: Survey, Evaluation, and Challenges

In recent years, distributed deep learning techniques are widely deploye...

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Deep Neural Networks (DNNs) are becoming an important tool in modern com...

Accelerating Data Loading in Deep Neural Network Training

Data loading can dominate deep neural network training time on large-sca...

RedSync : Reducing Synchronization Traffic for Distributed Deep Learning

Data parallelism has already become a dominant method to scale Deep Neur...

OpTree: An Efficient Algorithm for All-gather Operation in Optical Interconnect Systems

All-gather collective communication is one of the most important communi...

Block-distributed Gradient Boosted Trees

The Gradient Boosted Tree (GBT) algorithm is one of the most popular mac...

Please sign up or login with your details

Forgot password? Click here to reset