Give your Text Representation Models some Love: the Case for Basque

03/31/2020
by   Rodrigo Agerri, et al.
0

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2020

Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets

The availability of different pre-trained semantic models enabled the qu...
research
10/01/2020

Evaluating Multilingual BERT for Estonian

Recently, large pre-trained language models, such as BERT, have reached ...
research
04/19/2022

ALBETO and DistilBETO: Lightweight Spanish Language Models

In recent years there have been considerable advances in pre-trained lan...
research
10/12/2020

Load What You Need: Smaller Versions of Multilingual BERT

Pre-trained Transformer-based models are achieving state-of-the-art resu...
research
04/23/2021

Optimizing small BERTs trained for German NER

Currently, the most widespread neural network architecture for training ...
research
06/08/2023

Privacy- and Utility-Preserving NLP with Anonymized Data: A case study of Pseudonymization

This work investigates the effectiveness of different pseudonymization t...
research
10/16/2021

PAGnol: An Extra-Large French Generative Model

Access to large pre-trained models of varied architectures, in many diff...

Please sign up or login with your details

Forgot password? Click here to reset