GRDD: A Dataset for Greek Dialectal NLP

In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/08/2023

Defectors: A Large, Diverse Python Dataset for Defect Prediction

Defect prediction has been a popular research topic where machine learni...
research
03/29/2023

RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Information surrounds people in modern life. Text is a very efficient ty...
research
06/21/2021

Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data

Tabular datasets are the last "unconquered castle" for deep learning, wi...
research
06/02/2023

Concurrent Classifier Error Detection (CCED) in Large Scale Machine Learning Systems

The complexity of Machine Learning (ML) systems increases each year, wit...
research
09/18/2022

HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions

Commercial ML APIs offered by providers such as Google, Amazon and Micro...
research
03/26/2021

LS-CAT: A Large-Scale CUDA AutoTuning Dataset

The effectiveness of Machine Learning (ML) methods depend on access to l...
research
06/01/2021

Parameter-Efficient Neural Question Answering Models via Graph-Enriched Document Representations

As the computational footprint of modern NLP systems grows, it becomes i...

Please sign up or login with your details

Forgot password? Click here to reset