KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding

01/27/2021
by   Hyunjae Lee, et al.
0

A Lite BERT (ALBERT) has been introduced to scale up deep bidirectional representation learning for natural languages. Due to the lack of pretrained ALBERT models for Korean language, the best available practice is the multilingual model or resorting back to the any other BERT-based model. In this paper, we develop and pretrain KoreALBERT, a monolingual ALBERT model specifically for Korean language understanding. We introduce a new training objective, namely Word Order Prediction (WOP), and use alongside the existing MLM and SOP criteria to the same architecture and model parameters. Despite having significantly fewer model parameters (thus, quicker to train), our pretrained KoreALBERT outperforms its BERT counterpart on 6 different NLU tasks. Consistent with the empirical results in English by Lan et al., KoreALBERT seems to improve downstream task performance involving multi-sentence encoding for Korean language. The pretrained KoreALBERT is publicly available to encourage research and application development for Korean NLP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2021

UzBERT: pretraining a BERT model for Uzbek

Pretrained language models based on the Transformer architecture have ac...
research
10/23/2020

BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

Inductive transfer learning, enabled by self-supervised learning, have t...
research
10/20/2020

Optimal Subarchitecture Extraction For BERT

We extract an optimal subset of architectural parameters for the BERT ar...
research
05/09/2023

What is the best recipe for character-level encoder-only modelling?

This paper aims to benchmark recent progress in language understanding m...
research
02/24/2022

Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

The standard BERT adopts subword-based tokenization, which may break a w...
research
04/19/2022

ALBETO and DistilBETO: Lightweight Spanish Language Models

In recent years there have been considerable advances in pre-trained lan...
research
11/25/2021

TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Pretrained contextualized text representation models learn an effective ...

Please sign up or login with your details

Forgot password? Click here to reset