Pretraining on the Test Set Is All You Need

by   Rylan Schaeffer, et al.

Inspired by recent work demonstrating the promise of smaller Transformer-based language models pretrained on carefully curated data, we supercharge such approaches by investing heavily in curating a novel, high quality, non-synthetic data mixture based solely on evaluation benchmarks. Using our novel dataset mixture consisting of less than 100 thousand tokens, we pretrain a 1 million parameter transformer-based LLM phi-CTNL (pronounced “fictional") that achieves perfect results across diverse academic benchmarks, strictly outperforming all known foundation models. phi-CTNL also beats power-law scaling and exhibits a never-before-seen grokking-like ability to accurately predict downstream evaluation benchmarks' canaries.


EarthPT: a foundation model for Earth Observation

We introduce EarthPT – an Earth Observation (EO) pretrained transformer....

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

The Languini Kitchen serves as both a research collective and codebase d...

Duluth at SemEval-2020 Task 7: Using Surprise as a Key to Unlock Humorous Headlines

We use pretrained transformer-based language models in SemEval-2020 Task...

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Large language models are commonly trained on a mixture of filtered web ...

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

Genomic (DNA) sequences encode an enormous amount of information for gen...

A Family of Pretrained Transformer Language Models for Russian

Nowadays, Transformer language models (LMs) represent a fundamental comp...

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

Human linguistic capacity is often characterized by compositionality and...

Please sign up or login with your details

Forgot password? Click here to reset