STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization

by   Yachen Kang, et al.

Preference-based reinforcement learning (PbRL) promises to learn a complex reward function with binary human preference. However, such human-in-the-loop formulation requires considerable human effort to assign preference labels to segment pairs, hindering its large-scale applications. Recent approache has tried to reuse unlabeled segments, which implicitly elucidates the distribution of segments and thereby alleviates the human effort. And consistency regularization is further considered to improve the performance of semi-supervised learning. However, we notice that, unlike general classification tasks, in PbRL there exits a unique phenomenon that we defined as similarity trap in this paper. Intuitively, human can have diametrically opposite preferredness for similar segment pairs, but such similarity may trap consistency regularization fail in PbRL. Due to the existence of similarity trap, such consistency regularization improperly enhances the consistency possiblity of the model's predictions between segment pairs, and thus reduces the confidence in reward learning, since the augmented distribution does not match with the original one in PbRL. To overcome such issue, we present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions. Empirically, we demonstrate that our approach is capable of learning well a variety of locomotion and robotic manipulation behaviors using different semi-supervised alternatives and peer regularization.


SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning

Preference-based reinforcement learning (RL) has shown potential for tea...

A State Augmentation based approach to Reinforcement Learning from Human Preferences

Reinforcement Learning has suffered from poor reward specification, and ...

Reinforcement Learning from Diverse Human Preferences

The complexity of designing reward functions has been a major obstacle t...

Data Driven Reward Initialization for Preference based Reinforcement Learning

Preference-based Reinforcement Learning (PbRL) methods utilize binary fe...

Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning

Preference Based Reinforcement Learning has shown much promise for utili...

Learning from Label Proportions with Consistency Regularization

The problem of learning from label proportions (LLP) involves training c...

Zero-shot Preference Learning for Offline RL via Optimal Transport

Preference-based Reinforcement Learning (PbRL) has demonstrated remarkab...

Please sign up or login with your details

Forgot password? Click here to reset