On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly-Communicating MDPs

09/30/2022
by   Yi Wan, et al.
0

We show two average-reward off-policy control algorithms, Differential Q Learning (Wan, Naik, & Sutton 2021a) and RVI Q Learning (Abounadi Bertsekas & Borkar 2001), converge in weakly-communicating MDPs. Weakly-communicating MDPs are the most general class of MDPs that a learning algorithm with a single stream of experience can guarantee obtaining a policy achieving optimal reward rate. The original convergence proofs of the two algorithms require that all optimal policies induce unichains, which is not necessarily true for weakly-communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly-communicating MDPs. As a direct extension, we show that average-reward options algorithms introduced by (Wan, Naik, & Sutton 2021b) converge if the Semi-MDP induced by options is weakly-communicating.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset