Trust-PCL: An Off-Policy Trust Region Method for Continuous Control

07/06/2017
by   Ofir Nachum, et al.
0

Trust region methods, such as TRPO, are often used to stabilize policy optimization algorithms in reinforcement learning (RL). While current trust region strategies are effective for continuous control, they typically require a prohibitively large amount of on-policy interaction with the environment. To address this problem, we propose an off-policy trust region method, Trust-PCL. The algorithm is the result of observing that the optimal policy and state values of a maximum reward objective with a relative-entropy regularizer satisfy a set of multi-step pathwise consistencies along any path. Thus, Trust-PCL is able to maintain optimization stability while exploiting off-policy data to improve sample efficiency. When evaluated on a number of continuous control tasks, Trust-PCL improves the solution quality and sample efficiency of TRPO.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2020

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization...
research
12/26/2019

Quasi-Newton Trust Region Policy Optimization

We propose a trust region method for policy optimization that employs Qu...
research
06/25/2023

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Trust-region methods based on Kullback-Leibler divergence are pervasivel...
research
09/27/2018

Boosting Trust Region Policy Optimization by Normalizing Flows Policy

We propose to improve trust region policy search with normalizing flows ...
research
02/07/2019

Compatible Natural Gradient Policy Search

Trust-region methods have yielded state-of-the-art results in policy sea...
research
12/19/2020

Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach

In order for reinforcement learning techniques to be useful in real-worl...
research
03/29/2021

Distributionally Robust Trajectory Optimization Under Uncertain Dynamics via Relative-Entropy Trust Regions

Trajectory optimization and model predictive control are essential techn...

Please sign up or login with your details

Forgot password? Click here to reset