INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

A salient characteristic of large pre-trained language models (PTLMs) is a remarkable improvement in their generalization capability and emergence of new capabilities with increasing model capacity and pre-training dataset size. Consequently, we are witnessing the development of enormous models pushing the state-of-the-art. It is, however, imperative to realize that this inevitably leads to prohibitively long training times, extortionate computing costs, and a detrimental environmental impact. Significant efforts are underway to make PTLM training more efficient through innovations in model architectures, training pipelines, and loss function design, with scant attention being paid to optimizing the utility of training data. The key question that we ask is whether it is possible to train PTLMs by employing only highly informative subsets of the training data while maintaining downstream performance? Building upon the recent progress in informative data subset selection, we show how we can employ submodular optimization to select highly representative subsets of the training corpora. Our results demonstrate that the proposed framework can be applied to efficiently train multiple PTLMs (BERT, BioBERT, GPT-2) using only a fraction of data while retaining up to ∼99% of the performance of the fully-trained models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2023

The MiniPile Challenge for Data-Efficient Language Models

The ever-growing diversity of pre-training text corpora has equipped lan...
research
03/26/2023

Koala: An Index for Quantifying Overlaps with Pre-training Corpora

In very recent years more attention has been placed on probing the role ...
research
08/08/2023

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Large language models (LLMs) are routinely pre-trained on billions of to...
research
10/08/2020

On the importance of pre-training data volume for compact language models

Recent advances in language modeling have led to computationally intensi...
research
06/05/2023

Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models

Large pre-trained neural language models have brought immense progress t...
research
08/23/2023

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been pou...
research
08/10/2022

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

Web-crawled datasets have enabled remarkable generalization capabilities...

Please sign up or login with your details

Forgot password? Click here to reset