Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism

by   Ming Yin, et al.

Offline reinforcement learning, which seeks to utilize offline/historical data to optimize sequential decision-making strategies, has gained surging prominence in recent studies. Due to the advantage that appropriate function approximators can help mitigate the sample complexity burden in modern reinforcement learning problems, existing endeavors usually enforce powerful function representation models (e.g. neural networks) to learn the optimal policies. However, a precise understanding of the statistical limits with function representations, remains elusive, even when such a representation is linear. Towards this goal, we study the statistical limits of offline reinforcement learning with linear model representations. To derive the tight offline learning bound, we design the variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs). VAPVI leverages estimated variances of the value functions to reweight the Bellman residuals in the least-square pessimistic value iteration and provides improved offline learning bounds over the best-known existing results (whereas the Bellman residuals are equally weighted by design). More importantly, our learning bounds are expressed in terms of system quantities, which provide natural instance-dependent characterizations that previous results are short of. We hope our results draw a clearer picture of what offline learning should look like when linear representations are provided.


page 1

page 2

page 3

page 4


Variance-Aware Off-Policy Evaluation with Linear Function Approximation

We study the off-policy evaluation (OPE) problem in reinforcement learni...

Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient

Offline reinforcement learning, which aims at optimizing sequential deci...

What are the Statistical Limits of Offline RL with Linear Function Approximation?

Offline reinforcement learning seeks to utilize offline (observational) ...

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Offline or batch reinforcement learning seeks to learn a near-optimal po...

Bellman-consistent Pessimism for Offline Reinforcement Learning

The use of pessimism, when reasoning about datasets lacking exhaustive e...

A Sharp Characterization of Linear Estimators for Offline Policy Evaluation

Offline policy evaluation is a fundamental statistical problem in reinfo...

Offline Stochastic Shortest Path: Learning, Evaluation and Towards Optimality

Goal-oriented Reinforcement Learning, where the agent needs to reach the...

Please sign up or login with your details

Forgot password? Click here to reset