Model-free Policy Learning with Reward Gradients

03/09/2021
by   Qingfeng Lan, et al.
0

Policy gradient methods estimate the gradient of a policy objective solely based on either the likelihood ratio (LR) estimator or the reparameterization (RP) estimator for estimating gradients. Many policy gradient methods based on the LR estimator can be unified under the policy gradient theorem (Sutton et al., 2000). However, such a unifying theorem does not exist for policy gradient methods based on the RP estimator. Moreover, no existing method requires and uses both estimators beyond a trivial interpolation between them. In this paper, we provide a theoretical framework that unifies several existing policy gradient methods based on the RP estimator. Utilizing our framework, we introduce a novel strategy to compute the policy gradient that, for the first time, incorporates both the LR and RP estimators and can be unbiased only when both estimators are present. Based on this strategy, we develop a new on-policy algorithm called the Reward Policy Gradient algorithm, which is the first model-free policy gradient method to utilize reward gradients. Using an idealized environment, we show that policy gradient solely based on the RP estimator for rewards are biased even with true rewards whereas our combined estimator is not. Finally, we show that our method either performs comparably with or outperforms Proximal Policy Optimization – an LR-based on-policy method – on several continuous control tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/04/2022

A Temporal-Difference Approach to Policy Gradient Estimation

The policy gradient theorem (Sutton et al., 2000) prescribes the usage o...
research
10/20/2020

Proximal Policy Gradient: PPO with Policy Gradient

In this paper, we propose a new algorithm PPG (Proximal Policy Gradient)...
research
06/23/2023

Correcting discount-factor mismatch in on-policy policy gradient methods

The policy gradient theorem gives a convenient form of the policy gradie...
research
06/12/2020

Zeroth-order Deterministic Policy Gradient

Deterministic Policy Gradient (DPG) removes a level of randomness from s...
research
12/22/2021

An Alternate Policy Gradient Estimator for Softmax Policies

Policy gradient (PG) estimators for softmax policies are ineffective wit...
research
11/28/2016

Improving Policy Gradient by Exploring Under-appreciated Rewards

This paper presents a novel form of policy gradient for model-free reinf...
research
02/05/2019

Total stochastic gradient algorithms and applications in reinforcement learning

Backpropagation and the chain rule of derivatives have been prominent; h...

Please sign up or login with your details

Forgot password? Click here to reset