Call for Papers – The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

01/27/2023
by   Alex Warstadt, et al.
0

We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus. This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, we provide a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome). We will release a shared evaluation pipeline which scores models on a variety of benchmarks and tasks, including targeted syntactic evaluations and natural language understanding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/25/2019

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

We introduce a set of nine challenge tasks that test for the understandi...
research
02/25/2021

A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives

Modern natural language processing (NLP) methods employ self-supervised ...
research
10/02/2020

Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data

For natural language processing (NLP) tasks such as sentiment or topic c...
research
09/26/2018

Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis

Recent work using auxiliary prediction task classifiers to investigate t...
research
04/18/2021

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

While self-supervised learning has made rapid advances in natural langua...
research
05/20/2023

Patton: Language Model Pretraining on Text-Rich Networks

A real-world text corpus sometimes comprises not only text documents but...
research
05/22/2023

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, Toxicity

Pretraining is the preliminary and fundamental step in developing capabl...

Please sign up or login with your details

Forgot password? Click here to reset