Specializing Smaller Language Models towards Multi-Step Reasoning

by   Yao Fu, et al.
Allen Institute for Artificial Intelligence

The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 (≥ 175B) to T5 variants (≤ 11B). We propose model specialization, to specialize the model's ability towards a target task. The hypothesis is that large models (commonly viewed as larger than 100B) have strong modeling power, but are spread on a large spectrum of tasks. Small models (commonly viewed as smaller than 10B) have limited model capacity, but if we concentrate their capacity on a specific target task, the model can achieve a decent improved performance. We use multi-step math reasoning as our testbed because it is a very typical emergent ability. We show two important aspects of model abilities: (1). there exists a very complex balance/ tradeoff between language models' multi-dimensional abilities; (2). by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability. We further give comprehensive discussions about important design choices for better generalization, including the tuning data format, the start model checkpoint, and a new model selection method. We hope our practice and discoveries can serve as an important attempt towards specialized smaller models in the new research paradigm set by LLMs.


page 1

page 2

page 3

page 4


Learning Multi-Step Reasoning by Solving Arithmetic Tasks

Mathematical reasoning is regarded as a necessary ability for Language M...

The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code

Causal reasoning, the ability to identify cause-and-effect relationship,...

Are Emergent Abilities of Large Language Models a Mirage?

Recent work claims that large language models display emergent abilities...

Transcending Scaling Laws with 0.1

Scaling language models improves performance but comes with significant ...

Beyond Classification: Financial Reasoning in State-of-the-Art Language Models

Large Language Models (LLMs), consisting of 100 billion or more paramete...

Textbooks Are All You Need II: phi-1.5 technical report

We continue the investigation into the power of smaller Transformer-base...

Please sign up or login with your details

Forgot password? Click here to reset