When do you need Chain-of-Thought Prompting for ChatGPT?

by   Jiuhai Chen, et al.

Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models (LLMs). For example, by simply adding CoT instruction “Let's think step-by-step” to each input query of MultiArith dataset, GPT-3's accuracy can be improved from 17.7% to 78.7%. However, it is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer effective for certain tasks such as arithmetic reasoning while still keeping effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT usually achieves the best performance and can generate CoT even without being instructed to do so. Hence, it is plausible that ChatGPT has already been trained on these tasks with CoT and thus memorized the instruction so it implicitly follows such an instruction when applied to the same queries, even without CoT. Our analysis reflects a potential risk of overfitting/bias toward instructions introduced in IFT, which becomes more common in training LLMs. In addition, it indicates possible leakage of the pretraining recipe, e.g., one can verify whether a dataset and instruction were used in training ChatGPT. Our experiments report new baseline results of ChatGPT on a variety of reasoning tasks and shed novel insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.


page 3

page 4


LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4

Generative Pre-trained Transformer 4 (GPT-4) demonstrates impressive cha...

The Wisdom of Hindsight Makes Language Models Better Instruction Followers

Reinforcement learning has seen wide success in finetuning large languag...

Dissecting Chain-of-Thought: A Study on Compositional In-Context Learning of MLPs

Chain-of-thought (CoT) is a method that enables language models to handl...

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

We introduce MAmmoTH, a series of open-source large language models (LLM...

Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction

Despite constituting 65 underrepresented in generative AI research. Mean...

Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models

Prompt engineering is an essential technique for enhancing the abilities...

Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions

In recent years, progress in NLU has been driven by benchmarks. These be...

Please sign up or login with your details

Forgot password? Click here to reset