A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages

by   Nikita Martynov, et al.

Modern large language models demonstrate impressive capabilities in text generation and generalization. However, they often struggle with solving text editing tasks, particularly when it comes to correcting spelling errors and mistypings. In this paper, we present a methodology for generative spelling correction (SC), which was tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistypings in texts and studying the ways those errors can be emulated in correct sentences to effectively enrich generative models' pre-train procedure. We investigate the impact of such emulations and the models' abilities across different text domains. In this work, we investigate two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from particular dataset and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts. We conducted experiments employing various corruption strategies, models' architectures and sizes on the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation) is a library for automatic generative SC that includes a family of pre-trained generative models and built-in augmentation algorithms.


page 1

page 2

page 3

page 4


Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning

Fine-tuning pre-trained generative language models to down-stream langua...

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Substantial amounts of work are required to clean large collections of d...

FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

The emergence of generative pre-trained models has facilitated the synth...

DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-commerce Title and Review Summarization

We propose a novel domain-specific generative pre-training (DS-GPT) meth...

PAGnol: An Extra-Large French Generative Model

Access to large pre-trained models of varied architectures, in many diff...

Training Dynamics for Text Summarization Models

Pre-trained language models (e.g. BART) have shown impressive results wh...

Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints

The limits of open-ended generative models are unclear, yet increasingly...

Please sign up or login with your details

Forgot password? Click here to reset