Approximate Q-learning and SARSA(0) under the ε-greedy Policy: a Differential Inclusion Analysis

05/26/2022
by   Aditya Gopalan, et al.
0

Q-learning and SARSA(0) with linear function approximation, under ϵ-greedy exploration, are leading methods to estimate the optimal policy in Reinforcement Learning (RL). It has been empirically known that the discontinuous nature of the greedy policies causes these algorithms to exhibit complex phenomena such as i.) instability, ii.) policy oscillation and chattering, iii.) multiple attractors, and iv.) worst policy convergence. However, the literature lacks a formal recipe to explain these behaviors and this has been a long-standing open problem (Sutton, 1999). Our work addresses this by building the necessary mathematical framework using stochastic recursive inclusions and Differential Inclusions (DIs). From this novel viewpoint, our main result states that these approximate algorithms asymptotically converge to suitable invariant sets of DIs instead of differential equations, as is common elsewhere in RL. Furthermore, the nature of these deterministic DIs completely governs the limiting behaviors of these algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/23/2023

On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation

A common technique in reinforcement learning is to evaluate the value fu...
research
04/09/2021

Learning Sampling Policy for Faster Derivative Free Optimization

Zeroth-order (ZO, also known as derivative-free) methods, which estimate...
research
10/28/2020

Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning

Despite empirical success, the theory of reinforcement learning (RL) wit...
research
02/10/2020

On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

A simple and natural algorithm for reinforcement learning is Monte Carlo...
research
09/20/2019

On the Convergence of Approximate and Regularized Policy Iteration Schemes

Algorithms based on the entropy regularized framework, such as Soft Q-le...
research
01/07/2022

Mirror Learning: A Unifying Framework of Policy Optimisation

General policy improvement (GPI) and trust-region learning (TRL) are the...
research
05/29/2018

Depth and nonlinearity induce implicit exploration for RL

The question of how to explore, i.e., take actions with uncertain outcom...

Please sign up or login with your details

Forgot password? Click here to reset