Fine-Tuning Language Models with Advantage-Induced Policy Alignment

06/04/2023
by   Banghua Zhu, et al.
0

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2023

Aligning Language Models with Offline Reinforcement Learning from Human Feedback

Learning from human preferences is crucial for language models (LMs) to ...
research
04/11/2023

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignm...
research
09/13/2023

Statistical Rejection Sampling Improves Preference Optimization

Improving the alignment of language models with human preferences remain...
research
08/10/2023

Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length

The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal ro...
research
03/02/2021

Minimax Model Learning

We present a novel off-policy loss function for learning a transition mo...
research
09/01/2023

Efficient RLHF: Reducing the Memory Usage of PPO

Reinforcement Learning with Human Feedback (RLHF) has revolutionized lan...
research
04/25/2023

Stable and low-precision training for large-scale vision-language models

We introduce new methods for 1) accelerating and 2) stabilizing training...

Please sign up or login with your details

Forgot password? Click here to reset