Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling

by   Qi Meng, et al.

When using stochastic gradient descent to solve large-scale machine learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple machines if needed, and then perform several epochs of training on the re-shuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call data partition with global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. We prove that SGD with global shuffling has convergence guarantee in both convex and non-convex cases. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Second, we conduct the convergence analysis for SGD with local shuffling. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there's no communication between partitioned data. Finally, we consider the situation when the permutation after shuffling is not uniformly distributed (insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks.


page 1

page 2

page 3

page 4


Intermittent Pulling with Local Compensation for Communication-Efficient Federated Learning

Federated Learning is a powerful machine learning paradigm to cooperativ...

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Distributed parallel stochastic gradient descent algorithms are workhors...

The Convergence of Sparsified Gradient Methods

Distributed training of massive machine learning models, in particular d...

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Many distributed machine learning (ML) systems adopt the non-synchronous...

The Strength of Nesterov's Extrapolation in the Individual Convergence of Nonsmooth Optimization

The extrapolation strategy raised by Nesterov, which can accelerate the ...

PopSGD: Decentralized Stochastic Gradient Descent in the Population Model

The population model is a standard way to represent large-scale decentra...

Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate

Distributed sparse learning with a cluster of multiple machines has attr...

Please sign up or login with your details

Forgot password? Click here to reset