Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings

by   Pieter Fivez, et al.

We present an unsupervised context-sensitive spelling correction method for clinical free-text that uses word and character n-gram embeddings. Our method generates misspelling replacement candidates and ranks them according to their semantic fit, by calculating a weighted cosine similarity between the vectorized representation of a candidate and the misspelling context. To tune the parameters of this model, we generate self-induced spelling error corpora. We perform our experiments for two languages. For English, we greatly outperform off-the-shelf spelling correction tools on a manually annotated MIMIC-III test set, and counter the frequency bias of a noisy channel model, showing that neural embeddings can be successfully exploited to improve upon the state-of-the-art. For Dutch, we also outperform an off-the-shelf spelling correction tool on manually annotated clinical records from the Antwerp University Hospital, but can offer no empirical evidence that our method counters the frequency bias of a noisy channel model in this case as well. However, both our context-sensitive model and our implementation of the noisy channel model obtain high scores on the test set, establishing a state-of-the-art for Dutch clinical spelling correction with the noisy channel model.


page 9

page 10


Vartani Spellcheck – Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance

Traditional Optical Character Recognition (OCR) systems that generate te...

Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Text normalization is an important enabling technology for several NLP t...

Combining Self-Training and Self-Supervised Learning for Unsupervised Disfluency Detection

Most existing approaches to disfluency detection heavily rely on human-a...

A context sensitive real-time Spell Checker with language adaptability

We present a novel language adaptable spell checking system which detect...

Learning From How Human Correct

In industry NLP application, our manually labeled data has a certain num...

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction

A great deal of historical corpora suffer from errors introduced by the ...

Phonetic and Visual Priors for Decipherment of Informal Romanization

Informal romanization is an idiosyncratic process used by humans in info...

Please sign up or login with your details

Forgot password? Click here to reset