Accountable Off-Policy Evaluation With Kernel Bellman Statistics

08/15/2020
by   Yihao Feng, et al.
4

We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments, without requiring the execution of the new policy. This finds important applications in areas with high execution cost or safety concerns, such as medical diagnosis, recommendation systems and robotics. In practice, due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation, for the policy performance. In this work, we propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE into an optimization problem on a feasible set that catches the true state-action value function with high probability. The feasible set is constructed by leveraging statistical properties of a recently proposed kernel Bellman loss (Feng et al., 2019). We design an efficient computational approach for calculating our bounds, and extend it to perform post-hoc diagnosis and correction for existing estimators. Empirical results show that our method yields tight confidence intervals in different settings.

READ FULL TEXT
research
03/09/2021

Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds

Off-policy evaluation (OPE) is the task of estimating the expected rewar...
research
10/22/2020

CoinDICE: Off-Policy Confidence Interval Estimation

We study high-confidence behavior-agnostic off-policy evaluation in rein...
research
12/07/2019

Tighter Confidence Intervals for Rating Systems

Rating systems are ubiquitous, with applications ranging from product re...
research
07/27/2020

Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation

In reinforcement learning, it is typical to use the empirically observed...
research
06/20/2016

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

For an autonomous agent, executing a poor policy may be costly or even d...
research
04/17/2019

Robust Exploration with Tight Bayesian Plausibility Sets

Optimism about the poorly understood states and actions is the main driv...
research
01/25/2021

High-Confidence Off-Policy (or Counterfactual) Variance Estimation

Many sequential decision-making systems leverage data collected using pr...

Please sign up or login with your details

Forgot password? Click here to reset