Gap-Dependent Unsupervised Exploration for Reinforcement Learning

08/11/2021
by   Jingfeng Wu, et al.
8

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is ∝𝒪(1/ϵ^2), where ϵ>0 is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, ρ>0, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only 𝒪 (1/ϵ· (H^3SA / ρ + H^4 S^2 A) ) episodes of exploration, and is able to obtain an ϵ-optimal policy for a post-revealed reward with sub-optimality gap at least ρ, where S is the number of states, A is the number of actions, and H is the length of the horizon, obtaining a nearly quadratic saving in terms of ϵ. We show that, information-theoretically, this bound is nearly tight for ρ < Θ(1/(HS)) and H>1. We further show that ∝𝒪(1) sample bound is possible for H=1 (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro