The EOS Decision and Length Extrapolation

by   Benjamin Newman, et al.

Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40 the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.


Dynamic Prediction Length for Time Series with Sequence to Sequence Networks

Recurrent neural networks and sequence to sequence models require a pred...

Giraffe: Adventures in Expanding Context Lengths in LLMs

Modern large language models (LLMs) that rely on attention mechanisms ar...

Constructing sparse Davenport-Schinzel sequences by hypergraph edge coloring

A sequence is called r-sparse if every contiguous subsequence of length ...

Length Generalization in Arithmetic Transformers

We examine how transformers cope with two challenges: learning basic int...

State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations

Machine learning promises methods that generalize well from finite label...

Auto-Regressive Next-Token Predictors are Universal Learners

Large language models display remarkable capabilities in logical and mat...

How Many Pages? Paper Length Prediction from the Metadata

Being able to predict the length of a scientific paper may be helpful in...

Please sign up or login with your details

Forgot password? Click here to reset