GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

09/20/2023
by   Atakan Kara, et al.
0

Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2021

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

We present a corpus professionally annotated for grammatical error corre...
research
05/27/2021

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Synthetic data generation is widely known to boost the accuracy of neura...
research
10/31/2022

Evaluation of large-scale synthetic data for Grammar Error Correction

Grammar Error Correction(GEC) mainly relies on the availability of high ...
research
04/20/2021

Grammatical Error Generation Based on Translated Fragments

We perform neural machine translation of sentence fragments in order to ...
research
05/25/2023

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

India has a rich linguistic landscape with languages from 4 major langua...
research
05/10/2021

Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correction

Grammatical Error Correction (GEC) aims to correct writing errors and he...
research
08/16/2022

TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments

High-quality structured data with rich annotations are critical componen...

Please sign up or login with your details

Forgot password? Click here to reset