Decentralization Meets Quantization
Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: quantization for low bandwidth networks, and decentralization for high latency networks. In this paper, we explore a natural question: can the combination of both decentralization and quantization lead to a system that is robust to both bandwidth and latency? Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: simply quantizing data sent in a decentralized training algorithm would accumulate the error. In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call extrapolation compression and difference compression. We analyze both algorithms and prove both converge at the rate of O(1/√(nT)) where n is the number of workers and T is the number of iterations, matching the convergence rate for full precision, centralized training. We evaluate our algorithms on training deep learning models, and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with both high latency and low bandwidth.
READ FULL TEXT