Convergence to minima for the continuous version of Backtracking Gradient Descent
The main result of this paper is: Theorem. Let f:R^k→R be a C^1 function, so that ∇ f is locally Lipschitz continuous. Assume moreover that f is C^2 near its generalised saddle points. Fix real numbers δ _0>0 and 0<α <1. Then there is a smooth function h:R^k→ (0,δ _0] so that the map H:R^k→R^k defined by H(x)=x-h(x)∇ f(x) has the following property: (i) For all x∈R^k, we have f(H(x)))-f(x)≤ -α h(x)||∇ f(x)||^2. (ii) For every x_0∈R^k, the sequence x_n+1=H(x_n) either satisfies lim _n→∞||x_n+1-x_n||=0 or lim _n→∞||x_n||=∞. Each cluster point of {x_n} is a critical point of f. If moreover f has at most countably many critical points, then {x_n} either converges to a critical point of f or lim _n→∞||x_n||=∞. (iii) There is a set E_1⊂R^k of Lebesgue measure 0 so that for all x_0∈R^k\E_1, the sequence x_n+1=H(x_n), if converges, cannot converge to a generalised saddle point. (iv) There is a set E_2⊂R^k of Lebesgue measure 0 so that for all x_0∈R^k\E_2, any cluster point of the sequence x_n+1=H(x_n) is not a saddle point, and more generally cannot be an isolated generalised saddle point. Some other results are proven.
READ FULL TEXT