Square-root regret bounds for continuous-time episodic Markov decision processes

10/03/2022
by   Xuefeng Gao, et al.
0

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset