Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization

06/10/2019
by   Navid Azizan, et al.
0

Most modern learning problems are highly overparameterized, meaning that there are many more parameters than the number of training data points, and as a result, the training loss may have infinitely many global minima (parameter vectors that perfectly interpolate the training data). Therefore, it is important to understand which interpolating solutions we converge to, how they depend on the initialization point and the learning algorithm, and whether they lead to different generalization performances. In this paper, we study these questions for the family of stochastic mirror descent (SMD) algorithms, of which the popular stochastic gradient descent (SGD) is a special case. Our contributions are both theoretical and experimental. On the theory side, we show that in the overparameterized nonlinear setting, if the initialization is close enough to the manifold of global minima (something that comes for free in the highly overparameterized case), SMD with sufficiently small step size converges to a global minimum that is approximately the closest one in Bregman divergence. On the experimental side, our extensive experiments on standard datasets and models, using various initializations, various mirror descents, and various Bregman divergences, consistently confirms that this phenomenon happens in deep learning. Our experiments further indicate that there is a clear difference in the generalization performance of the solutions obtained by different SMD algorithms. Experimenting on a standard image dataset and network architecture with SMD with different kinds of implicit regularization, ℓ_1 to encourage sparsity, ℓ_2 yielding SGD, and ℓ_10 to discourage large components in the parameter vector, consistently and definitively shows that ℓ_10-SMD has better generalization performance than SGD, which in turn has better generalization performance than ℓ_1-SMD.

READ FULL TEXT

page 11

page 34

research
12/25/2018

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Many modern learning tasks involve fitting nonlinear models to data whic...
research
06/04/2018

Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization

Stochastic descent methods (of the gradient and mirror varieties) have b...
research
02/02/2023

Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent

It is well known that the finite step-size (h) in Gradient Descent (GD) ...
research
10/06/2022

Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

Gradient regularization (GR) is a method that penalizes the gradient nor...
research
02/18/2023

The Generalization Error of Stochastic Mirror Descent on Over-Parametrized Linear Models

Despite being highly over-parametrized, and having the ability to fully ...
research
06/06/2019

Bad Global Minima Exist and SGD Can Reach Them

Several recent works have aimed to explain why severely overparameterize...
research
12/05/2018

Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss

Uncertainty sampling, a popular active learning algorithm, is used to re...

Please sign up or login with your details

Forgot password? Click here to reset