Bad Global Minima Exist and SGD Can Reach Them

06/06/2019
by   Shengchao Liu, et al.
0

Several recent works have aimed to explain why severely overparameterized models, generalize well when trained by Stochastic Gradient Descent (SGD). The emergent consensus explanation has two parts: the first is that there are "no bad local minima", while the second is that SGD performs implicit regularization by having a bias towards low complexity models. We revisit both of these ideas in the context of image classification with common deep neural network architectures. Our first finding is that there exist bad global minima, i.e., models that fit the training set perfectly, yet have poor generalization. Our second finding is that given only unlabeled training data, we can easily construct initializations that will cause SGD to quickly converge to such bad global minima. For example, on CIFAR, CINIC10, and (Restricted) ImageNet, this can be achieved by starting SGD at a model derived by fitting random labels on the training data: while subsequent SGD training (with the correct labels) will reach zero training error, the resulting model will exhibit a test accuracy degradation of up to 40 Finally, we show that regularization seems to provide SGD with an escape route: once heuristics such as data augmentation are used, starting from a complex model (adversarial initialization) has no effect on the test accuracy.

READ FULL TEXT
research
05/25/2023

Implicit bias of SGD in L_2-regularized linear DNNs: One-way jumps from high to low rank

The L_2-regularized loss of Deep Linear Networks (DLNs) with more than o...
research
06/25/2021

Assessing Generalization of SGD via Disagreement

We empirically show that the test error of deep networks can be estimate...
research
11/07/2022

Highly over-parameterized classifiers generalize since bad solutions are rare

We study the generalization of over-parameterized classifiers where Empi...
research
06/10/2019

Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization

Most modern learning problems are highly overparameterized, meaning that...
research
12/25/2018

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Many modern learning tasks involve fitting nonlinear models to data whic...
research
11/19/2016

Local minima in training of neural networks

There has been a lot of recent interest in trying to characterize the er...
research
08/12/2017

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

Over the past few years, softmax and SGD have become a commonly used com...

Please sign up or login with your details

Forgot password? Click here to reset