GottBERT: a pure German Language Model

by   Raphael Scheible, et al.

Lately, pre-trained language models advanced the field of natural language processing (NLP). The introduction of Bidirectional Encoders for Transformers (BERT) and its optimized version RoBERTa have had significant impact and increased the relevance of pre-trained models. First, research in this field mainly started on English data followed by models trained with multilingual text corpora. However, current research shows that multilingual models are inferior to monolingual models. Currently, no German single language RoBERTa model is yet published, which we introduce in this work (GottBERT). The German portion of the OSCAR data set was used as text corpus. In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones. GottBERT was pre-trained related to the original RoBERTa model using fairseq. All downstream tasks were trained using hyperparameter presets taken from the benchmark of German BERT. The experiments were setup utilizing FARM. Performance was measured by the F_1 score. GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture. Even without extensive hyper-parameter optimization, in all NER and one text classification task, GottBERT already outperformed all other tested German and multilingual models. In order to support the German NLP field, we publish GottBERT under the AGPLv3 license.


page 1

page 2

page 3

page 4


MiLMo:Minority Multilingual Pre-trained Language Model

Pre-trained language models are trained on large-scale unsupervised data...

Re-Evaluating GermEval17 Using German Pre-Trained Language Models

The lack of a commonly used benchmark data set (collection) such as (Sup...

One Size Does Not Fit All: Finding the Optimal N-gram Sizes for FastText Models across Languages

Unsupervised word representation learning from large corpora is badly ne...

Deploying a BERT-based Query-Title Relevance Classifier in a Production System: a View from the Trenches

The Bidirectional Encoder Representations from Transformers (BERT) model...

Distant Reading of the German Coalition Deal: Recognizing Policy Positions with BERT-based Text Classification

Automated text analysis has become a widely used tool in political scien...

German's Next Language Model

In this work we present the experiments which lead to the creation of ou...

L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages

The monolingual Hindi BERT models currently available on the model hub d...

Please sign up or login with your details

Forgot password? Click here to reset