An Elementary Proof that Q-learning Converges Almost Surely

08/05/2021
by   Matthew T. Regehr, et al.
0

Watkins' and Dayan's Q-learning is a model-free reinforcement learning algorithm that iteratively refines an estimate for the optimal action-value function of an MDP by stochastically "visiting" many state-ation pairs [Watkins and Dayan, 1992]. Variants of the algorithm lie at the heart of numerous recent state-of-the-art achievements in reinforcement learning, including the superhuman Atari-playing deep Q-network [Mnih et al., 2015]. The goal of this paper is to reproduce a precise and (nearly) self-contained proof that Q-learning converges. Much of the available literature leverages powerful theory to obtain highly generalizable results in this vein. However, this approach requires the reader to be familiar with and make many deep connections to different research areas. A student seeking to deepen their understand of Q-learning risks becoming caught in a vicious cycle of "RL-learning Hell". For this reason, we give a complete proof from start to finish using only one external result from the field of stochastic approximation, despite the fact that this minimal dependence on other results comes at the expense of some "shininess".

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2021

Online Robust Reinforcement Learning with Model Uncertainty

Robust reinforcement learning (RL) is to find a policy that optimizes th...
research
12/05/2018

Relative Entropy Regularized Policy Iteration

We present an off-policy actor-critic algorithm for Reinforcement Learni...
research
11/23/2020

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

Reinforcement learning (RL) with linear function approximation has recei...
research
06/29/2021

A Convergent and Efficient Deep Q Network Algorithm

Despite the empirical success of the deep Q network (DQN) reinforcement ...
research
05/22/2017

AIXIjs: A Software Demo for General Reinforcement Learning

Reinforcement learning is a general and powerful framework with which to...
research
11/02/2020

A Variant of the Wang-Foster-Kakade Lower Bound for the Discounted Setting

Recently, Wang et al. (2020) showed a highly intriguing hardness result ...
research
11/27/2015

On the convergence of cycle detection for navigational reinforcement learning

We consider a reinforcement learning framework where agents have to navi...

Please sign up or login with your details

Forgot password? Click here to reset