Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

10/17/2021
by   Ming Yin, et al.
0

We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy μ. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches O(∑_h=1^H∑_s_h,a_hd^π^⋆_h(s_h,a_h)√(Var_P_s_h,a_h(V^⋆_h+1+r_h)/d^μ_h(s_h,a_h))√(1/n)). In complementary, we also prove a per-instance information-theoretical lower bound under the weak assumption that d^μ_h(s_h,a_h)>0 if d^π^⋆_h(s_h,a_h)>0. Different from the previous minimax lower bounds, the per-instance lower bound (via local minimaxity) is a much stronger criterion as it applies to individual instances separately. Here π^⋆ is a optimal policy, μ is the behavior policy and d_h^μ is the marginal state-action probability. We call the above equation the intrinsic offline reinforcement learning bound since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the assumption-free regime (where we make no assumption on μ) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2022

On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation

Sample-efficient offline reinforcement learning (RL) with linear functio...
research
02/02/2021

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

We consider the problem of offline reinforcement learning (RL) – a well-...
research
05/05/2022

Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning

Dynamic mechanism design has garnered significant attention from both co...
research
12/30/2020

Is Pessimism Provably Efficient for Offline RL?

We study offline reinforcement learning (RL), which aims to learn an opt...
research
05/13/2021

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

This work studies the statistical limits of uniform convergence for offl...
research
03/25/2021

Nearly Horizon-Free Offline Reinforcement Learning

We revisit offline reinforcement learning on episodic time-homogeneous t...
research
06/23/2023

Active Coverage for PAC Reinforcement Learning

Collecting and leveraging data with good coverage properties plays a cru...

Please sign up or login with your details

Forgot password? Click here to reset