An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

02/07/2021
by   ElMehdi Boujou, et al.
0

Natural Language Processing (NLP) is today a very active field of research and innovation. Many applications need however big sets of data for supervised learning, suitably labelled for the training purpose. This includes applications for the Arabic language and its national dialects. However, such open access labeled data sets in Arabic and its dialects are lacking in the Data Science ecosystem and this lack can be a burden to innovation and research in this field. In this work, we present an open data set of social data content in several Arabic dialects. This data was collected from the Twitter social network and consists on +50K twits in five (5) national dialects. Furthermore, this data was labeled for several applications, namely dialect detection, topic detection and sentiment analysis. We publish this data as an open access data to encourage innovation and encourage other works in the field of NLP for Arabic dialects and social media. A selection of models were built using this data set and are presented in this paper along with their performances.

READ FULL TEXT
research
04/16/2021

Open data for Moroccan license plates for OCR applications : data collection, labeling, and model construction

Significant number of researches have been developed recently around int...
research
06/11/2023

AraMUS: Pushing the Limits of Data and Model Scale for Arabic Natural Language Processing

Developing monolingual large Pre-trained Language Models (PLMs) is shown...
research
09/28/2022

ArNLI: Arabic Natural Language Inference for Entailment and Contradiction Detection

Natural Language Inference (NLI) is a hot topic research in natural lang...
research
11/15/2015

A System for Extracting Sentiment from Large-Scale Arabic Social Data

Social media data in Arabic language is becoming more and more abundant....
research
10/22/2022

A Benchmark Study of Contrastive Learning for Arabic Social Meaning

Contrastive learning (CL) brought significant progress to various NLP ta...
research
12/30/2019

AraNet: A Deep Learning Toolkit for Arabic Social Media

We describe AraNet, a collection of deep learning Arabic social media pr...
research
01/10/2022

A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

In academia, plagiarism is certainly not an emerging concern, but it bec...

Please sign up or login with your details

Forgot password? Click here to reset