Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs

by   Naman Agarwal, et al.

Q-learning is a popular Reinforcement Learning (RL) algorithm which is widely used in practice with function approximation (Mnih et al., 2015). In contrast, existing theoretical results are pessimistic about Q-learning. For example, (Baird, 1995) shows that Q-learning does not converge even with linear function approximation for linear MDPs. Furthermore, even for tabular MDPs with synchronous updates, Q-learning was shown to have sub-optimal sample complexity (Li et al., 2021;Azar et al., 2013). The goal of this work is to bridge the gap between practical success of Q-learning and the relatively pessimistic theoretical results. The starting point of our work is the observation that in practice, Q-learning is used with two important modifications: (i) training with two networks, called online network and target network simultaneously (online target learning, or OTL) , and (ii) experience replay (ER) (Mnih et al., 2015). While they have been observed to play a significant role in the practical success of Q-learning, a thorough theoretical understanding of how these two modifications improve the convergence behavior of Q-learning has been missing in literature. By carefully combining Q-learning with OTL and reverse experience replay (RER) (a form of experience replay), we present novel methods Q-Rex and Q-RexDaRe (Q-Rex + data reuse). We show that Q-Rex efficiently finds the optimal policy for linear MDPs (or more generally for MDPs with zero inherent Bellman error with linear approximation (ZIBEL)) and provide non-asymptotic bounds on sample complexity – the first such result for a Q-learning method for this class of MDPs under standard assumptions. Furthermore, we demonstrate that Q-RexDaRe in fact achieves near optimal sample complexity in the tabular setting, improving upon the existing results for vanilla Q-learning.


page 1

page 2

page 3

page 4


Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

While policy optimization algorithms have played an important role in re...

Stabilizing Q-learning with Linear Architectures for Provably Efficient Learning

The Q-learning algorithm is a simple and widely-used stochastic approxim...

Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

Mirror descent value iteration (MDVI), an abstraction of Kullback-Leible...

Streaming Linear System Identification with Reverse Experience Replay

We consider the problem of estimating a stochastic linear time-invariant...

Convergence Results For Q-Learning With Experience Replay

A commonly used heuristic in RL is experience replay (e.g. <cit.>), in w...

Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

While much progress has been made in understanding the minimax sample co...

Bilinear Classes: A Structural Framework for Provable Generalization in RL

This work introduces Bilinear Classes, a new structural framework, which...

Please sign up or login with your details

Forgot password? Click here to reset