Learning by Distilling Context

by   Charlie Snell, et al.

Language models significantly benefit from context tokens, such as prompts or scratchpads. They perform better when prompted with informative instructions, and they acquire new reasoning capabilities by generating a scratch-pad before predicting the final answers. However, they do not internalize these performance gains, which disappear when the context tokens are gone. Our work proposes to apply context distillation so that a language model can improve itself by internalizing these gains. Concretely, given a synthetic unlabeled input for the target task, we condition the model on “[instructions] + [task-input]” to predict “[scratch-pad] + [final answer]”; then we fine-tune the same model to predict its own “[final answer]” conditioned on the “[task-input]”, without seeing the “[instructions]” or using the “[scratch-pad]”. We show that context distillation is a general method to train language models, and it can effectively internalize 3 types of training signals. First, it can internalize abstract task instructions and explanations, so we can iteratively update the model parameters with new instructions and overwrite old ones. Second, it can internalize step-by-step reasoning for complex tasks (e.g., 8-digit addition), and such a newly acquired capability proves to be useful for other downstream tasks. Finally, it can internalize concrete training examples, and it outperforms directly learning with gradient descent by 9% on the SPIDER Text-to-SQL dataset; furthermore, combining context distillation operations can internalize more training examples than the context window size allows.


page 1

page 2

page 3

page 4


Probing in Context: Toward Building Robust Classifiers via Probing Large Language Models

Large language models are able to learn new tasks in context, where they...

Improving Cross-Task Generalization with Step-by-Step Instructions

Instruction tuning has been shown to be able to improve cross-task gener...

Large Language Models Are Reasoning Teachers

Language models (LMs) have demonstrated remarkable performance on downst...

Can language models learn from explanations in context?

Large language models can perform new tasks by adapting to a few in-cont...

Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA

Large Language Models (LLMs) have shown outstanding performance across w...

Meta-learning via Language Model In-context Tuning

The goal of meta-learning is to learn to adapt to a new task with only a...

Learning to Compress Prompts with Gist Tokens

Prompting is now the primary way to utilize the multitask capabilities o...

Please sign up or login with your details

Forgot password? Click here to reset