An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

04/15/2020
by   Bernhard Liebl, et al.
4

One important and particularly challenging step in the optical character recognition (OCR) of historical documents with complex layouts, such as newspapers, is the separation of text from non-text content (e.g. page borders or illustrations). This step is commonly referred to as page segmentation. While various rule-based algorithms have been proposed, the applicability of Deep Neural Networks (DNNs) for this task recently has gained a lot of attention. In this paper, we perform a systematic evaluation of 11 different published DNN backbone architectures and 9 different tiling and scaling configurations for separating text, tables or table column lines. We also show the influence of the number of labels and the number of training pages on the segmentation quality, which we measure using the Matthews Correlation Coefficient. Our results show that (depending on the task) Inception-ResNet-v2 and EfficientNet backbones work best, vertical tiling is generally preferable to other tiling approaches, and training data that comprises 30 to 40 pages will be sufficient most of the time.

READ FULL TEXT

page 5

page 20

page 21

research
05/06/2020

Automated Transcription for Pre-Modern Japanese Kuzushiji Documents by Random Lines Erasure and Curriculum Learning

Recognizing the full-page of Japanese historical documents is a challeng...
research
07/02/2020

Automatic Page Segmentation Without Decompressing the Run-Length Compressed Text Documents

Page segmentation is considered to be the crucial stage for the automati...
research
12/09/2020

Page Tables: Keeping them Flat and Hot (Cached)

As memory capacity has outstripped TLB coverage, applications that use l...
research
09/22/2020

Whole page recognition of historical handwriting

Historical handwritten documents guard an important part of human knowle...
research
02/20/2020

Processing topical queries on images of historical newspaper pages

Historical newspapers are a source of research for the human and social ...
research
12/23/2021

Digital Editions as Distant Supervision for Layout Analysis of Printed Books

Archivists, textual scholars, and historians often produce digital editi...
research
07/02/2022

Sequence-aware multimodal page classification of Brazilian legal documents

The Brazilian Supreme Court receives tens of thousands of cases each sem...

Please sign up or login with your details

Forgot password? Click here to reset