On the Expected Dynamics of Nonlinear TD Learning

05/29/2019
by   David Brandfonbrener, et al.
8

While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected dynamics of the TD(0) algorithm. We prove that this ODE is attracted to a compact set for smooth homogeneous functions including some ReLU networks. For over-parametrized and well-conditioned functions in sufficiently reversible environments we prove convergence to the global optimum. This result improves when using k-step or λ returns. Finally, we generalize a divergent counterexample to a family of divergent problems to motivate the assumptions needed to prove convergence.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset