Neural OCR Post-Hoc Correction of Historical Corpora

02/01/2021
by   Lijun Lyu, et al.
0

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model's correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3

READ FULL TEXT
research
07/30/2023

Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts

Over the past few decades, large archives of paper-based documents such ...
research
04/23/2020

A Tool for Facilitating OCR Postediting in Historical Documents

Optical character recognition (OCR) for historical documents is a comple...
research
05/28/2019

A Cost Efficient Approach to Correct OCR Errors in Large Document Collections

Word error rate of an ocr is often higher than its character error rate....
research
06/12/2021

Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction

Digitization of historical documents is a challenging task in many digit...
research
10/22/2021

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Substantial amounts of work are required to clean large collections of d...
research
02/28/2022

Inkorrect: Online Handwriting Spelling Correction

We introduce Inkorrect, a data- and label-efficient approach for online ...
research
10/28/2020

Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript

This paper outlines the creation of three corpora for multilingual compa...

Please sign up or login with your details

Forgot password? Click here to reset