Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space

02/21/2023
by   Jonatha Anselmi, et al.
0

In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the diameter D of the MDP is Ω(S^S), where S is the number of states. Therefore, the existing lower and upper bounds on the regret at timeT, of order O(√(DSAT)) for MDPs with S states and A actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm Ucrl2 is in fact upper bounded by 𝒪̃(√(E_2AT)) where E_2 is related to the weighted second moment of the stationary measure of a reference policy. Importantly, E_2 is bounded independently of S. Thus, our bound is asymptotically independent of the number of states and of the diameter. This result is based on a careful study of the number of visits performed by the learning algorithm to the states of the MDP, which is highly non-uniform.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/05/2018

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov ...
research
06/23/2020

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Modern tasks in reinforcement learning are always with large state and a...
research
02/12/2020

Regret Bounds for Discounted MDPs

Recently, it has been shown that carefully designed reinforcement learni...
research
02/12/2018

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

We introduce SCAL, an algorithm designed to perform efficient exploratio...
research
07/12/2014

Extreme State Aggregation Beyond MDPs

We consider a Reinforcement Learning setup where an agent interacts with...
research
06/03/2013

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Given a Markov Decision Process (MDP) with n states and a totalnumber m ...
research
05/10/2019

Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

We consider a stochastic inventory control problem under censored demand...

Please sign up or login with your details

Forgot password? Click here to reset