Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

09/20/2020
by   Mengfan Xu, et al.
9

EXP-based algorithms are often used for exploration in multi-armed bandit. We revisit the EXP3.P algorithm and establish both the lower and upper bounds of regret in the Gaussian multi-armed bandit setting, as well as a more general distribution option. The analyses do not require bounded rewards compared to classical regret assumptions. We also extend EXP4 from multi-armed bandit to reinforcement learning to incentivize exploration by multiple agents. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset