Scaling Laws for Neural Language Models

01/23/2020
by   Jared Kaplan, et al.
0

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2021

Scaling Laws for Acoustic Models

There is a recent trend in machine learning to increase model quality by...
research
09/24/2021

Is the Number of Trainable Parameters All That Actually Matters?

Recent work has identified simple empirical scaling laws for language mo...
research
04/22/2020

A Neural Scaling Law from the Dimension of the Data Manifold

When data is plentiful, the loss achieved by well-trained neural network...
research
06/18/2020

On the Predictability of Pruning Across Scales

We show that the error of magnitude-pruned networks follows a scaling la...
research
10/28/2020

Scaling Laws for Autoregressive Generative Modeling

We identify empirical scaling laws for the cross-entropy loss in four do...
research
02/02/2021

Scaling Laws for Transfer

We study empirical scaling laws for transfer learning between distributi...
research
07/05/2022

Machine Learning Model Sizes and the Parameter Gap

We study trends in model size of notable machine learning systems over t...

Please sign up or login with your details

Forgot password? Click here to reset