Off-Policy Evaluation in Partially Observed Markov Decision Processes

10/24/2021
by   Yuchen Hu, et al.
0

We consider off-policy evaluation of dynamic treatment rules under the assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy given long enough draws from the behavior policy. Furthermore, we establish an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length), with an exponent that depends on the overlap of the target and behavior policies, and on the mixing time of the underlying system. We also establish a polynomial minimax lower bound for off-policy evaluation under the POMDP assumption, and show that its exponent has the same qualitative dependence on overlap and mixing time as obtained in our upper bound. Together, our upper and lower bounds imply that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes, but strictly easier than model-free off-policy evaluation.

READ FULL TEXT
research
05/30/2023

Sharp high-probability sample complexities for policy evaluation with linear function approximation

This paper is concerned with the problem of policy evaluation with linea...
research
05/19/2023

Off-policy evaluation beyond overlap: partial identification through smoothness

Off-policy evaluation (OPE) is the problem of estimating the value of a ...
research
01/31/2022

Fundamental Performance Limits for Sensor-Based Robot Control and Policy Learning

Our goal is to develop theory and algorithms for establishing fundamenta...
research
10/28/2021

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

In applications of offline reinforcement learning to observational data,...
research
09/12/2014

On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to ...
research
01/21/2017

Learning Policies for Markov Decision Processes from Data

We consider the problem of learning a policy for a Markov decision proce...
research
12/19/2022

Policy learning "without” overlap: Pessimism and generalized empirical Bernstein's inequality

This paper studies offline policy learning, which aims at utilizing obse...

Please sign up or login with your details

Forgot password? Click here to reset