Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

by   Weihang Xu, et al.

We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has n≥ 2 neurons. We prove the global convergence of randomly initialized gradient descent with a O(T^-3) rate. This is the first global convergence result for this problem beyond the exact-parameterization setting (n=1) in which the gradient descent enjoys an exp(-Ω(T)) rate. Perhaps surprisingly, we further present an Ω(T^-3) lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that over-parameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.


page 1

page 2

page 3

page 4


Gradient Descent Provably Optimizes Over-parameterized Neural Networks

One of the mystery in the success of neural networks is randomly initial...

Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions

This theoretical paper is devoted to developing a rigorous theory for de...

Learning a Single Neuron with Bias Using Gradient Descent

We theoretically study the fundamental problem of learning a single neur...

Magnitude and Angle Dynamics in Training Single ReLU Neurons

To understand learning the dynamics of deep ReLU networks, we investigat...

On Gradient Descent Convergence beyond the Edge of Stability

Gradient Descent (GD) is a powerful workhorse of modern machine learning...

On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons

Given a dense shallow neural network, we focus on iteratively creating, ...

Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

The ability of neural networks to provide `best in class' approximation ...

Please sign up or login with your details

Forgot password? Click here to reset