Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

by   Jiayi Huang, et al.

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are heavy-tailed, i.e., with only finite (1+ϵ)-th moments for some ϵ∈(0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation. We first design an algorithm, Heavy-OFUL, for heavy-tailed linear bandits, achieving an instance-dependent T-round regret of Õ(d T^1-ϵ/2(1+ϵ)√(∑_t=1^T ν_t^2) + d T^1-ϵ/2(1+ϵ)), the first of this kind. Here, d is the feature dimension, and ν_t^1+ϵ is the (1+ϵ)-th central moment of the reward at the t-th round. We further show the above bound is minimax optimal when applied to the worst-case instances in stochastic and deterministic linear bandits. We then extend this algorithm to the RL settings with linear function approximation. Our algorithm, termed as Heavy-LSVI-UCB, achieves the first computationally efficient instance-dependent K-episode regret of Õ(d √(H 𝒰^*) K^1/1+ϵ + d √(H 𝒱^* K)). Here, H is length of the episode, and 𝒰^*, 𝒱^* are instance-dependent quantities scaling with the central moment of reward and value functions, respectively. We also provide a matching minimax lower bound Ω(d H K^1/1+ϵ + d √(H^3 K)) to demonstrate the optimality of our algorithm in the worst case. Our result is achieved via a novel robust self-normalized concentration inequality that may be of independent interest in handling heavy-tailed noise in general online regression problems.


page 1

page 2

page 3

page 4


Minimax Policy for Heavy-tailed Multi-armed Bandits

We study the stochastic Multi-Armed Bandit (MAB) problem under worst cas...

No-Regret Reinforcement Learning with Heavy-Tailed Rewards

Reinforcement learning algorithms typically assume rewards to be sampled...

Differentially Private Episodic Reinforcement Learning with Heavy-tailed Rewards

In this paper, we study the problem of (finite horizon tabular) Markov d...

Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards

In a broad class of reinforcement learning applications, stochastic rewa...

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Obtaining first-order regret bounds – regret bounds scaling not as the w...

Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

While much progress has been made in understanding the minimax sample co...

Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

In this paper, we study the problem of stochastic linear bandits with fi...

Please sign up or login with your details

Forgot password? Click here to reset