Positive-Unlabeled Reward Learning

11/01/2019
by   Danfei Xu, et al.
5

Learning reward functions from data is a promising path towards achieving scalable Reinforcement Learning (RL) for robotics. However, a major challenge in training agents from learned reward models is that the agent can learn to exploit errors in the reward model to achieve high reward behaviors that do not correspond to the intended task. These reward delusions can lead to unintended and even dangerous behaviors. On the other hand, adversarial imitation learning frameworks tend to suffer the opposite problem, where the discriminator learns to trivially distinguish agent and expert behavior, resulting in reward models that produce low reward signal regardless of the input state. In this paper, we connect these two classes of reward learning methods to positive-unlabeled (PU) learning, and we show that by applying a large-scale PU learning algorithm to the reward learning problem, we can address both the reward under- and over-estimation problems simultaneously. Our approach drastically improves both GAIL and supervised reward learning, without any additional assumptions.

READ FULL TEXT

page 11

page 16

research
07/20/2022

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations

We study the problem of offline Imitation Learning (IL) where an agent a...
research
12/09/2018

Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning

The performance of adversarial dialogue generation models relies on the ...
research
05/23/2023

Video Prediction Models as Rewards for Reinforcement Learning

Specifying reward signals that allow agents to learn complex behaviors i...
research
09/25/2022

Unsupervised Reward Shaping for a Robotic Sequential Picking Task from Visual Observations in a Logistics Scenario

We focus on an unloading problem, typical of the logistics sector, model...
research
05/17/2021

Learning to Win, Lose and Cooperate through Reward Signal Evolution

Solving a reinforcement learning problem typically involves correctly pr...
research
01/10/2022

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Reward hacking – where RL agents exploit gaps in misspecified reward fun...
research
12/09/2022

On the Sensitivity of Reward Inference to Misspecified Human Models

Inferring reward functions from human behavior is at the center of value...

Please sign up or login with your details

Forgot password? Click here to reset