Minimax Optimal Reinforcement Learning for Discounted MDPs

10/01/2020
by   Jiafan He, et al.
17

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) in the tabular setting. We propose a model-based algorithm named UCBVI-γ, which is based on the optimism in the face of uncertainty principle and the Bernstein-type bonus. It achieves Õ(√(SAT)/(1-γ)^1.5) regret, where S is the number of states, A is the number of actions, γ is the discount factor and T is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least Ω̃(√(SAT)/(1-γ)^1.5). Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-γ is near optimal for discounted MDPs.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset