Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

01/20/2017
by   Christian Reul, et al.
0

This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text recognition is illustrated in detail using the example of 'Der Heiligen Leben', printed in Nuremberg in 1488. For each step the required time expenditure was recorded. The character recognition yielded excellent results both on character (97.57 Furthermore, a comparison of a highly automated (LAREX) and a manual (Aletheia) method for layout analysis was performed. By considerably automating the segmentation the required human effort was reduced significantly from over 100 hours to less than six hours, resulting in only a slight drop in OCR accuracy. Realistic estimates for the human effort necessary for full text extraction from incunabula can be derived from this study. The printed pages of the complete work together with the OCR result is available online ready to be inspected and downloaded.

READ FULL TEXT

page 2

page 3

page 5

research
09/09/2019

OCR4all -- An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Optical Character Recognition (OCR) on historical printings is a challen...
research
02/03/2022

DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts

Digitization of newspapers is of interest for many reasons including pre...
research
01/20/2017

LAREX - A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books

A semi-automatic open-source tool for layout analysis on early printed b...
research
08/06/2016

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

This article describes the results of a case study that applies Neural N...
research
10/05/2022

Intelligent Information Retrieval: Techniques for Character Recognition and Structured Data Extraction

The day-to-day activities of every corporation in-volve working with a h...
research
02/22/2023

The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions

Scientific articles published prior to the "age of digitization" in the ...

Please sign up or login with your details

Forgot password? Click here to reset