Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points

by   Rachael Tappenden, et al.

This work shows that applying Gradient Descent (GD) with a fixed step size to minimize a (possibly nonconvex) quadratic function is equivalent to running the Power Method (PM) on the gradients. The connection between GD with a fixed step size and the PM, both with and without fixed momentum, is thus established. Consequently, valuable eigen-information is available via GD. Recent examples show that GD with a fixed step size, applied to locally quadratic nonconvex functions, can take exponential time to escape saddle points (Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnabas Poczos: "Gradient descent can take exponential time to escape saddle points"; S. Paternain, A. Mokhtari, and A. Ribeiro: "A newton-based method for nonconvex optimization with fast evasion of saddle points"). Here, those examples are revisited and it is shown that eigenvalue information was missing, so that the examples may not provide a complete picture of the potential practical behaviour of GD. Thus, ongoing investigation of the behaviour of GD on nonconvex functions, possibly with an adaptive or variable step size, is warranted. It is shown that, in the special case of a quadratic in R^2, if an eigenvalue is known, then GD with a fixed step size will converge in two iterations, and a complete eigen-decomposition is available. By considering the dynamics of the gradients and iterates, new step size strategies are proposed to improve the practical performance of GD. Several numerical examples are presented, which demonstrate the advantages of exploiting the GD–PM connection.


page 17

page 18


Speed learning on the fly

The practical performance of online stochastic gradient descent algorith...

Towards Statistical and Computational Complexities of Polyak Step Size Gradient Descent

We study the statistical and computational complexities of the Polyak st...

Faster gradient descent and the efficient recovery of images

Much recent attention has been devoted to gradient descent algorithms wh...

Cutting Some Slack for SGD with Adaptive Polyak Stepsizes

Tuning the step size of stochastic gradient descent is tedious and error...

On Avoiding Local Minima Using Gradient Descent With Large Learning Rates

It has been widely observed in training of neural networks that when app...

Finite Regret and Cycles with Fixed Step-Size via Alternating Gradient Descent-Ascent

Gradient descent is arguably one of the most popular online optimization...

An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models

Using gradient descent (GD) with fixed or decaying step-size is standard...

Please sign up or login with your details

Forgot password? Click here to reset