On Gradient Descent Convergence beyond the Edge of Stability

by   Lei Chen, et al.

Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a 'bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called "Edge of Stability", where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability. In this work, we study a local condition for such an unstable convergence around a local minima in a low dimensional setting. We then leverage these insights to establish global convergence of a two-layer single-neuron ReLU student network aligning with the teacher neuron in a large learning rate beyond the Edge of Stability under population loss. Meanwhile, while the difference of norms of the two layers is preserved by gradient flow, we show that GD above the edge of stability induces a balancing effect, leading to the same norms across the layers.


page 1

page 2

page 3

page 4


Concavifiability and convergence: necessary and sufficient conditions for gradient descent analysis

Convergence of the gradient descent algorithm has been attracting renewe...

On Avoiding Local Minima Using Gradient Descent With Large Learning Rates

It has been widely observed in training of neural networks that when app...

Learning threshold neurons via the "edge of stability"

Existing analyses of neural network training often operate under the unr...

Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

We revisit the problem of learning a single neuron with ReLU activation ...

Understanding Gradient Descent on Edge of Stability in Deep Learning

Deep learning experiments in Cohen et al. (2021) using deterministic Gra...

Learning a Single Neuron with Bias Using Gradient Descent

We theoretically study the fundamental problem of learning a single neur...

On the symmetries in the dynamics of wide two-layer neural networks

We consider the idealized setting of gradient flow on the population ris...

Please sign up or login with your details

Forgot password? Click here to reset