√(n)-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

09/05/2019
by   Kefan Dong, et al.
4

In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in (Jiang et al., 2017), known as OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as contextual bandit problems. This technique enables us to apply the active elimination and expert weighting methods from (Dudik et al., 2011), instead of the random action exploration scheme used in the original OLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy elimination step. To the best of our knowledge, this is the first √(n)-regret result for reinforcement learning in stochastic MDPs with general value function approximation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2022

On the connection between Bregman divergence and value in regularized Markov decision processes

In this short note we derive a relationship between the Bregman divergen...
research
05/30/2023

Solving Robust MDPs through No-Regret Dynamics

Reinforcement Learning is a powerful framework for training agents to na...
research
06/03/2021

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

This paper presents the first model-free, simulator-free reinforcement l...
research
09/12/2021

Improved Algorithms for Misspecified Linear Markov Decision Processes

For the misspecified linear Markov decision process (MLMDP) model of Jin...
research
02/28/2019

Active Exploration in Markov Decision Processes

We introduce the active exploration problem in Markov decision processes...
research
01/21/2020

TopRank+: A Refinement of TopRank Algorithm

Online learning to rank is a core problem in machine learning. In Lattim...
research
01/27/2019

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

In this paper we revisit the method of off-policy corrections for reinfo...

Please sign up or login with your details

Forgot password? Click here to reset