A Probabilistic Framework for Lexicon-based Keyword Spotting in Handwritten Text Images

04/09/2021
by   E. Vidal, et al.
0

Query by String Keyword Spotting (KWS) is here considered as a key technology for indexing large collections of handwritten text images to allow fast textual access to the contents of these collections. Under this perspective, a probabilistic framework for lexicon-based KWS in text images is presented. The presentation aims at providing a tutorial view that helps to understand the relations between classical statements of KWS and the relative challenges entailed by these statements. More specifically, the development of the proposed framework makes it self-evident that word recognition or classification implicitly or explicitly underlies any formulation of KWS. Moreover, it clearly suggests that the same statistical models and training methods successfully used for handwriting text recognition can advantageously be used also for KWS, even though KWS does not generally require or rely on any kind of previously produced image transcripts. These ideas are developed into a specific, probabilistically sound approach for segmentation-free, lexicon-based, query-by-string KWS. Experiments carried out using this approach are presented, which support the consistency and general interest of the proposed framework. Several datasets, traditionally used for KWS benchmarking are considered, with results significantly better than those previously published for these datasets. In addition, results on two new, larger handwritten text image datasets are reported, showing the great potential of the methods proposed in this paper for indexing and textual search in large collections of handwritten documents.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2022

Open Set Classification of Untranscribed Handwritten Documents

Huge amounts of digital page images of important manuscripts are preserv...
research
10/02/2021

Asking questions on handwritten document collections

This work addresses the problem of Question Answering (QA) on handwritte...
research
03/22/2017

Neural Ctrl-F: Segmentation-free Query-by-String Word Spotting in Handwritten Manuscript Collections

In this paper, we approach the problem of segmentation-free query-by-str...
research
04/06/2020

Indexing Highly Repetitive String Collections

Two decades ago, a breakthrough in indexing string collections made it p...
research
10/14/2020

Contextual Pattern Matching

The research on indexing repetitive string collections has focused on th...
research
09/21/2022

A Few Shot Multi-Representation Approach for N-gram Spotting in Historical Manuscripts

Despite recent advances in automatic text recognition, the performance r...
research
07/07/2021

Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches

Although abbreviations are fairly common in handwritten sources, particu...

Please sign up or login with your details

Forgot password? Click here to reset