Zero-shot Preference Learning for Offline RL via Optimal Transport

by   Runze Liu, et al.

Preference-based Reinforcement Learning (PbRL) has demonstrated remarkable efficacy in aligning rewards with human intentions. However, a significant challenge lies in the need of substantial human labels, which is costly and time-consuming. Additionally, the expensive preference data obtained from prior tasks is not typically reusable for subsequent task learning, leading to extensive labeling for each new task. In this paper, we propose a novel zero-shot preference-based RL algorithm that leverages labeled preference data from source tasks to infer labels for target tasks, eliminating the requirement for human queries. Our approach utilizes Gromov-Wasserstein distance to align trajectory distributions between source and target tasks. The solved optimal transport matrix serves as a correspondence between trajectories of two tasks, making it possible to identify corresponding trajectory pairs between tasks and transfer the preference labels. However, learning directly from inferred labels that contains a fraction of noisy labels will result in an inaccurate reward function, subsequently affecting policy performance. To this end, we introduce Robust Preference Transformer, which models the rewards as Gaussian distributions and incorporates reward uncertainty in addition to reward mean. The empirical results on robotic manipulation tasks of Meta-World and Robomimic show that our method has strong capabilities of transferring preferences between tasks and learns reward functions from noisy labels robustly. Furthermore, we reveal that our method attains near-oracle performance with a small proportion of scripted labels.


Benchmarks and Algorithms for Offline Preference-Based Reward Learning

Learning a reward function from human preferences is challenging as it t...

Reinforcement Learning from Diverse Human Preferences

The complexity of designing reward functions has been a major obstacle t...

Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning

Preference-based reinforcement learning (RL) algorithms help avoid the p...

Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation

Learning new task-specific skills from a few trials is a fundamental cha...

Few-Shot Preference Learning for Human-in-the-Loop RL

While reinforcement learning (RL) has become a more popular approach for...

SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning

Preference-based reinforcement learning (RL) has shown potential for tea...

STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

Preference-based reinforcement learning (PbRL) promises to learn a compl...

Please sign up or login with your details

Forgot password? Click here to reset