IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

11/02/2020
by   Fajri Koto, et al.
11

Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2021

LaoPLM: Pre-trained Language Models for Lao

Trained on the large corpus, pre-trained language models (PLMs) can capt...
research
02/02/2023

idT5: Indonesian Version of Multilingual T5 Transformer

Indonesian language is spoken by almost 200 million people and is the 10...
research
09/25/2021

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Pre-trained transformers are now the de facto models in Natural Language...
research
07/28/2022

An Interpretability Evaluation Benchmark for Pre-trained Language Models

While pre-trained language models (LMs) have brought great improvements ...
research
12/15/2022

You were saying? – Spoken Language in the V3C Dataset

This paper presents an analysis of the distribution of spoken language i...
research
07/07/2023

A Side-by-side Comparison of Transformers for English Implicit Discourse Relation Classification

Though discourse parsing can help multiple NLP fields, there has been no...
research
07/10/2023

KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report Summarization

In this paper, we introduce CheXOFA, a new pre-trained vision-language m...

Please sign up or login with your details

Forgot password? Click here to reset