Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

by   Hritik Bansal, et al.

Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: ∼70 removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, suggesting that induction heads are among the heads capable of more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained to perform in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.


page 5

page 7

page 8

page 10

page 17

page 18

page 19

page 20


MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

Large-scale language models have shown the ability to adapt to a new tas...

In-context Learning and Induction Heads

"Induction heads" are attention heads that implement a simple algorithm ...

Instruction Induction: From Few Examples to Natural Language Task Descriptions

Large language models are able to perform a task by conditioning on a fe...

Scaling Laws and Interpretability of Learning from Repeated Data

Recent large language models have been trained on vast datasets, but als...

Augmenting Large Language Model Translators via Translation Memories

Using translation memories (TMs) as prompts is a promising approach to i...

In-Context Learning of Large Language Models Explained as Kernel Regression

Large language models (LLMs) have initiated a paradigm shift in transfer...

A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks

We study the phenomenon of in-context learning (ICL) exhibited by large ...

Please sign up or login with your details

Forgot password? Click here to reset