Robust Text Line Detection in Historical Documents: Learning and Evaluation Methods

03/23/2022
by   Mélodie Boillet, et al.
0

Text line segmentation is one of the key steps in historical document understanding. It is challenging due to the variety of fonts, contents, writing styles and the quality of documents that have degraded through the years. In this paper, we address the limitations that currently prevent people from building line segmentation models with a high generalization capacity. We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net and show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages. This paper also highlights the importance of the annotations used during training: each existing dataset is annotated differently. We present a unification of the annotations and show its positive impact on the final text recognition results. In this end, we present a complete evaluation strategy using standard pixel-level metrics, object-level ones and introducing goal-oriented metrics.

READ FULL TEXT

page 7

page 8

page 12

page 14

page 15

page 17

research
04/10/2007

Text Line Segmentation of Historical Documents: a Survey

There is a huge amount of historical documents in libraries and in vario...
research
12/15/2020

docExtractor: An off-the-shelf historical document element extraction

We present docExtractor, a generic approach for extracting visual elemen...
research
12/23/2021

Digital Editions as Distant Supervision for Layout Analysis of Printed Books

Archivists, textual scholars, and historians often produce digital editi...
research
09/09/2020

Unconstrained Text Detection in Manga: a New Dataset and Baseline

The detection and recognition of unconstrained text is an open problem i...
research
07/19/2022

You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

Layout Analysis (the identification of zones and their classification) i...
research
02/09/2018

A Two-Stage Method for Text Line Detection in Historical Documents

This work presents a two-stage text line detection method for historical...
research
08/05/2021

Exploring Out-of-Distribution Generalization in Text Classifiers Trained on Tobacco-3482 and RVL-CDIP

To be robust enough for widespread adoption, document analysis systems i...

Please sign up or login with your details

Forgot password? Click here to reset