Training language models to follow instructions with human feedback

by   Long Ouyang, et al.

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.


Fine-tuning Language Models with Generative Adversarial Feedback

Reinforcement Learning with Human Feedback (RLHF) has been demonstrated ...

Aligning Large Language Models through Synthetic Feedback

Aligning large language models (LLMs) to human values has become increas...

Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

Language models can generate harmful and biased outputs and exhibit unde...

The Capacity for Moral Self-Correction in Large Language Models

We test the hypothesis that language models trained with reinforcement l...

Languages are Rewards: Chain of Hindsight Finetuning using Human Feedback

Learning from human preferences is important for language models to be h...

Task Ambiguity in Humans and Language Models

Language models have recently achieved strong performance across a wide ...

Please sign up or login with your details

Forgot password? Click here to reset