NoCoLA: The Norwegian Corpus of Linguistic Acceptability

by   Matias Jentoft, et al.

While there has been a surge of large language models for Norwegian in recent years, we lack any tool to evaluate their understanding of grammaticality. We present two new Norwegian datasets for this task. NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. On the other hand, NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner, i.e. without any further training. In this paper, we describe both datasets in detail, show how to use them for different flavors of language models, and conduct a comparative study of the existing Norwegian language models.


page 1

page 2

page 3

page 4


Beyond the limitations of any imaginable mechanism: large language models and psycholinguistics

Large language models are not detailed models of human linguistic proces...

Comparative Study of Language Models on Cross-Domain Data with Model Agnostic Explainability

With the recent influx of bidirectional contextualized transformer langu...

Go-tuning: Improving Zero-shot Learning Abilities of Smaller Language Models

With increasing scale, large language models demonstrate both quantitati...

RuCoLA: Russian Corpus of Linguistic Acceptability

Linguistic acceptability (LA) attracts the attention of the research com...

Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI

Detecting "toxic" language in internet content is a pressing social and ...

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

Large language models (LLMs) can be used to generate smaller, more refin...

Please sign up or login with your details

Forgot password? Click here to reset