KinyaBERT: a Morphology-aware Kinyarwanda Language Model

03/16/2022
by   Antoine Nzeyimana, et al.
0

Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2 recognition task and by 4.3 benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.

READ FULL TEXT
research
03/16/2022

Morphological Processing of Low-Resource Languages: Where We Are and What's Next

Automatic morphological processing can aid downstream natural language p...
research
10/12/2022

Subword Segmental Language Modelling for Nguni Languages

Subwords have become the standard units of text in NLP, enabling efficie...
research
06/13/2023

Tokenization with Factorized Subword Encoding

In recent years, language models have become increasingly larger and mor...
research
08/15/2019

What's Wrong with Hebrew NLP? And How to Make it Right

For languages with simple morphology, such as English, automatic annotat...
research
11/24/2020

Enhancing deep neural networks with morphological information

Currently, deep learning approaches are superior in natural language pro...
research
06/15/2021

Knowledge-Rich BERT Embeddings for Readability Assessment

Automatic readability assessment (ARA) is the task of evaluating the lev...
research
08/18/2015

Probabilistic Modelling of Morphologically Rich Languages

This thesis investigates how the sub-structure of words can be accounted...

Please sign up or login with your details

Forgot password? Click here to reset