From Importance Sampling to Doubly Robust Policy Gradient

10/20/2019
by   Jiawei Huang, et al.
0

We show that policy gradient (PG) and its variance reduction variants can be derived by taking finite difference of function evaluations supplied by estimators from the importance sampling (IS) family for off-policy evaluation (OPE). Starting from the doubly robust (DR) estimator [Jiang and Li, 2016], we provide a simple derivation of a very general and flexible form of PG, which subsumes the state-of-the-art variance reduction technique [Cheng et al., 2019] as its special case and immediately hints at further variance reduction opportunities overlooked by existing literature.

READ FULL TEXT
research
10/15/2019

Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling

We establish a connection between the importance sampling estimators typ...
research
09/13/2021

State Relevance for Off-Policy Evaluation

Importance sampling-based estimators for off-policy evaluation (OPE) are...
research
01/31/2022

Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration

Policy gradient (PG) estimation becomes a challenge when we are not allo...
research
05/03/2023

Enhancing Precision with the Local Pivotal Method: A General Variance Reduction Approach

The local pivotal method (LPM) is a successful sampling method for takin...
research
10/16/2019

Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

Infinite horizon off-policy policy evaluation is a highly challenging ta...
research
11/20/2015

Variance Reduction in SGD by Distributed Importance Sampling

Humans are able to accelerate their learning by selecting training mater...
research
06/09/2019

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Off-policy evaluation (OPE) in both contextual bandits and reinforcement...

Please sign up or login with your details

Forgot password? Click here to reset