RRHF: Rank Responses to Align Language Models with Human Feedback without tears

04/11/2023
by   Zheng Yuan, et al.
0

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and these models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). PPO, however, is sensitive to hyperparameters and requires a minimum of four models in its standard implementation, which makes it hard to train. In contrast, we propose a novel learning paradigm called RRHF, which scores responses generated by different sampling policies and learns to align them with human preferences through ranking loss. RRHF can efficiently align language model output probabilities with human preferences as robust as fine-tuning and it only needs 1 to 2 models during tuning. In addition, RRHF can be considered an extension of SFT and reward models while being simpler than PPO in terms of coding, model counts, and hyperparameters. The entire alignment process can be accomplished within a single RRHF training session. We evaluate RRHF using LLaMA and Alpaca on Helpful and Harmless data, demonstrating performance comparable to PPO.

READ FULL TEXT
research
08/23/2023

Aligning Language Models with Offline Reinforcement Learning from Human Feedback

Learning from human preferences is crucial for language models (LMs) to ...
research
06/30/2023

Preference Ranking Optimization for Human Alignment

Large language models (LLMs) often contain misleading content, emphasizi...
research
05/04/2023

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Recent AI-assistant agents, such as ChatGPT, predominantly rely on super...
research
09/01/2023

Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

Due to language models' propensity to generate toxic or hateful response...
research
08/10/2023

Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length

The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal ro...
research
09/05/2023

Making Large Language Models Better Reasoners with Alignment

Reasoning is a cognitive process of using evidence to reach a sound conc...
research
06/04/2023

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Reinforcement learning from human feedback (RLHF) has emerged as a relia...

Please sign up or login with your details

Forgot password? Click here to reset