Reliable Off-policy Evaluation for Reinforcement Learning

11/08/2020
by   Jie Wang, et al.
0

In a sequential decision-making problem, off-policy evaluation (OPE) estimates the expected cumulative reward of a target policy using logged transition data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns, or inability of exploration. Hence it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates with statistical guarantees and develop non-asymptotic as well as asymptotic confidence intervals for OPE, leveraging methodologies from distributionally robust optimization. Our theoretical results are also supported by empirical analysis.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset