Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting

by   Rylan Schaeffer, et al.

Language models can be prompted to reason through problems in a manner that significantly improves performance. However, why such prompting improves performance is unclear. Recent work showed that using logically invalid Chain-of-Thought (CoT) prompting improves performance almost as much as logically valid CoT prompting, and that editing CoT prompts to replace problem-specific information with abstract information or out-of-distribution information typically doesn't harm performance. Critics have responded that these findings are based on too few and too easily solved tasks to draw meaningful conclusions. To resolve this dispute, we test whether logically invalid CoT prompts offer the same level of performance gains as logically valid prompts on the hardest tasks in the BIG-Bench benchmark, termed BIG-Bench Hard (BBH). We find that the logically invalid reasoning prompts do indeed achieve similar performance gains on BBH tasks as logically valid reasoning prompts. We also discover that some CoT prompts used by previous works contain logical errors. This suggests that covariates beyond logically valid reasoning are responsible for performance improvements.


Certified Reasoning with Language Models

Language models often achieve higher accuracy when reasoning step-by-ste...

Language models show human-like content effects on reasoning

Abstract reasoning is a key ability for an intelligent system. Large lan...

Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement

Prompting methods such as Chain-of-Thought (CoT) have shed new light on ...

Transcending Scaling Laws with 0.1

Scaling language models improves performance but comes with significant ...

Stay on topic with Classifier-Free Guidance

Classifier-Free Guidance (CFG) has recently emerged in text-to-image gen...

Training Verifiers to Solve Math Word Problems

State-of-the-art language models can match human performance on many tas...

Deduction under Perturbed Evidence: Probing Student Simulation Capabilities of Large Language Models

We explore whether Large Language Models (LLMs) are capable of logical r...

Please sign up or login with your details

Forgot password? Click here to reset