Handwritten and Printed Text Segmentation: A Signature Case Study

07/15/2023
by   Sina Gholamian, et al.
0

While analyzing scanned documents, handwritten text can overlay printed text. This causes difficulties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurts downstream NLP tasks. Prior research either focuses only on the binary classification of handwritten text, or performs a three-class segmentation of the document, i.e., recognition of handwritten, printed, and background pixels. This results in the assignment of the handwritten and printed overlapping pixels to only one of the classes, and thus, they are not accounted for in the other class. Thus, in this research, we develop novel approaches for addressing the challenges of handwritten and printed text segmentation with the goal of recovering text in different classes in whole, especially improving the segmentation performance on the overlapping parts. As such, to facilitate with this task, we introduce a new dataset, SignaTR6K, collected from real legal documents, as well as a new model architecture for handwritten and printed text segmentation task. Our best configuration outperforms the prior work on two different datasets by 17.9 7.3

READ FULL TEXT

page 3

page 8

page 12

page 16

page 17

research
08/26/2021

StackMix and Blot Augmentations for Handwritten Text Recognition

This paper proposes a handwritten text recognition(HTR) system that outp...
research
07/07/2021

Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches

Although abbreviations are fairly common in handwritten sources, particu...
research
06/22/2021

Evaluation of a Region Proposal Architecture for Multi-task Document Layout Analysis

Automatically recognizing the layout of handwritten documents is an impo...
research
04/28/2019

TMIXT: A process flow for Transcribing MIXed handwritten and machine-printed Text

Handling large corpuses of documents is of significant importance in man...
research
06/03/2023

TransDocAnalyser: A Framework for Offline Semi-structured Handwritten Document Analysis in the Legal Domain

State-of-the-art offline Optical Character Recognition (OCR) frameworks ...
research
08/18/2020

EASTER: Efficient and Scalable Text Recognizer

Recent progress in deep learning has led to the development of Optical C...
research
06/20/2022

Open Set Classification of Untranscribed Handwritten Documents

Huge amounts of digital page images of important manuscripts are preserv...

Please sign up or login with your details

Forgot password? Click here to reset