On the Convergence of Monte Carlo UCB for Random-Length Episodic MDPs

09/07/2022
by   Zixuan Dong, et al.
0

In reinforcement learning, Monte Carlo algorithms update the Q function by averaging the episodic returns. In the Monte Carlo UCB (MC-UCB) algorithm, the action taken in each state is the action that maximizes the Q function plus a UCB exploration term, which biases the choice of actions to those that have been chosen less frequently. Although there has been significant work on establishing regret bounds for MC-UCB, most of that work has been focused on finite-horizon versions of the problem, for which each episode terminates after a constant number of steps. For such finite-horizon problems, the optimal policy depends both on the current state and the time within the episode. However, for many natural episodic problems, such as games like Go and Chess and robotic tasks, the episode is of random length and the optimal policy is stationary. For such environments, it is an open question whether the Q-function in MC-UCB will converge to the optimal Q function; we conjecture that, unlike Q-learning, it does not converge for all MDPs. We nevertheless show that for a large class of MDPs, which includes stochastic MDPs such as blackjack and deterministic MDPs such as Go, the Q-function in MC-UCB converges almost surely to the optimal Q function. An immediate corollary of this result is that it also converges almost surely for all finite-horizon MDPs. We also provide numerical experiments, providing further insights into MC-UCB.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2020

On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

A simple and natural algorithm for reinforcement learning is Monte Carlo...
research
09/26/2019

Action Selection for MDPs: Anytime AO* vs. UCT

In the presence of non-admissible heuristics, A* and other best-first al...
research
06/07/2023

Convergence of SARSA with linear function approximation: The random horizon case

The reinforcement learning algorithm SARSA combined with linear function...
research
03/13/2018

Active Reinforcement Learning with Monte-Carlo Tree Search

Active Reinforcement Learning (ARL) is a twist on RL where the agent obs...
research
06/03/2018

Exploration in Structured Reinforcement Learning

We address reinforcement learning problems with finite state and action ...
research
06/04/2018

TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning

Our understanding of reinforcement learning (RL) has been shaped by theo...
research
02/11/2022

Exploration of Differentiability in a Proton Computed Tomography Simulation Framework

Objective. Algorithmic differentiation (AD) can be a useful technique to...

Please sign up or login with your details

Forgot password? Click here to reset