Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction

by   Shirong Ma, et al.

Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive experiments and detailed analyses not only demonstrate that the training data constructed by our method effectively improves the performance of CGEC models, but also reflect that our benchmark is an excellent resource for further development of the CGEC field.


page 1

page 2

page 3

page 4


FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction

Grammatical Error Correction (GEC) has been broadly applied in automatic...

Overview of CTC 2021: Chinese Text Correction for Native Speakers

In this paper, we present an overview of the CTC 2021, a Chinese text co...

CSCD-IME: Correcting Spelling Errors Generated by Pinyin IME

Chinese Spelling Correction (CSC) is a task to detect and correct spelli...

NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts

We introduce NaSGEC, a new dataset to facilitate research on Chinese gra...

READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises

For many real-world applications, the user-generated inputs usually cont...

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Synthetic data generation is widely known to boost the accuracy of neura...

Adaptable Filtering using Hierarchical Embeddings for Chinese Spell Check

Spell check is a useful application which involves processing noisy huma...

Please sign up or login with your details

Forgot password? Click here to reset