Trained on 100 million words and still in shape: BERT meets British National Corpus

03/17/2023
by   David Samuel, et al.
0

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2017

emLam -- a Hungarian Language Modeling baseline

This paper aims to make up for the lack of documented baselines for Hung...
research
04/17/2023

The MiniPile Challenge for Data-Efficient Language Models

The ever-growing diversity of pre-training text corpora has equipped lan...
research
12/16/2021

Does Pre-training Induce Systematic Inference? How Masked Language Models Acquire Commonsense Knowledge

Transformer models pre-trained with a masked-language-modeling objective...
research
04/20/2020

MPNet: Masked and Permuted Pre-training for Language Understanding

BERT adopts masked language modeling (MLM) for pre-training and is one o...
research
04/04/2023

San-BERT: Extractive Summarization for Sanskrit Documents using BERT and it's variants

In this work, we develop language models for the Sanskrit language, name...
research
08/31/2019

Quantity doesn't buy quality syntax with neural language models

Recurrent neural networks can learn to predict upcoming words remarkably...
research
07/11/2023

Vacaspati: A Diverse Corpus of Bangla Literature

Bangla (or Bengali) is the fifth most spoken language globally; yet, the...

Please sign up or login with your details

Forgot password? Click here to reset