Making Large Language Models Better Reasoners with Alignment

by   Peiyi Wang, et al.

Reasoning is a cognitive process of using evidence to reach a sound conclusion. The reasoning capability is essential for large language models (LLMs) to serve as the brain of the artificial general intelligence agent. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. However, we find that the fine-tuned LLMs suffer from an Assessment Misalignment problem, i.e., they frequently assign higher scores to subpar COTs, leading to potential limitations in their reasoning abilities. To address this problem, we introduce an Alignment Fine-Tuning (AFT) paradigm, which involves three steps: 1) fine-tuning LLMs with COT training data; 2) generating multiple COT responses for each question, and categorizing them into positive and negative ones based on whether they achieve the correct answer; 3) calibrating the scores of positive and negative responses given by LLMs with a novel constraint alignment loss. Specifically, the constraint alignment loss has two objectives: a) Alignment, which guarantees that positive scores surpass negative scores to encourage answers with high-quality COTs; b) Constraint, which keeps the negative scores confined to a reasonable range to prevent the model degradation. Beyond just the binary positive and negative feedback, the constraint alignment loss can be seamlessly adapted to the ranking situations when ranking feedback is accessible. Furthermore, we also delve deeply into recent ranking-based alignment methods, such as DPO, RRHF, and PRO, and discover that the constraint, which has been overlooked by these approaches, is also crucial for their performance. Extensive experiments on four reasoning benchmarks with both binary and ranking feedback demonstrate the effectiveness of AFT.


RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignm...

The Poison of Alignment

From the perspective of content safety issues, alignment has shown to li...

Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias

Recent studies show that instruction tuning and learning from human feed...

Generating texts under constraint through discriminator-guided MCTS

Large pre-trained language models (LM) based on Transformers allow to ge...

Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective

Are large language models (LLMs) like GPT-3 psychologically safe? In thi...

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

A number of recent benchmarks seek to assess how well models handle natu...

On Contrastive Learning of Semantic Similarity forCode to Code Search

This paper introduces a novel code-to-code search technique that enhance...

Please sign up or login with your details

Forgot password? Click here to reset