POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning

05/23/2022
by   Frederik Schubert, et al.
11

The goal of Unsupervised Reinforcement Learning (URL) is to find a reward-agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) - a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is theoretically justified, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19 best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new the state-of-the-art on the URLB.

READ FULL TEXT

page 8

page 20

research
10/14/2022

Skill-Based Reinforcement Learning with Intrinsic Reward Matching

While unsupervised skill discovery has shown promise in autonomously acq...
research
11/07/2022

C3PO: Learning to Achieve Arbitrary Goals via Massively Entropic Pretraining

Given a particular embodiment, we propose a novel method (C3PO) that lea...
research
08/23/2023

Language Reward Modulation for Pretraining Reinforcement Learning

Using learned reward functions (LRFs) as a means to solve sparse-reward ...
research
12/26/2022

Toward Efficient Automated Feature Engineering

Automated Feature Engineering (AFE) refers to automatically generate and...
research
09/26/2022

DEFT: Diverse Ensembles for Fast Transfer in Reinforcement Learning

Deep ensembles have been shown to extend the positive effect seen in typ...
research
02/11/2015

Off-Policy Reward Shaping with Ensembles

Potential-based reward shaping (PBRS) is an effective and popular techni...
research
08/10/2023

RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation

Robust estimation is a crucial and still challenging task, which involve...

Please sign up or login with your details

Forgot password? Click here to reset