Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

06/12/2023
by   Jiayi Huang, et al.
0

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are heavy-tailed, i.e., with only finite (1+ϵ)-th moments for some ϵ∈(0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, Heavy-OFUL, for heavy-tailed linear bandits, achieving an instance-dependent T-round regret of Õ(d T^1-ϵ/2(1+ϵ)√(∑_t=1^T ν_t^2) + d T^1-ϵ/2(1+ϵ)), the first of this kind. Here, d is the feature dimension, and ν_t^1+ϵ is the (1+ϵ)-th central moment of the reward at the t-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as Heavy-LSVI-UCB, achieves the first computationally efficient instance-dependent K-episode regret of Õ(d √(H 𝒰^*) K^1/1+ϵ + d √(H 𝒱^* K)). Here, H is length of the episode, and 𝒰^*, 𝒱^* are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound Ω(d H K^1/1+ϵ + d √(H^3 K)) to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro