OCR quality affects perceived usefulness of historical newspaper clippings – a user study

03/04/2022
by   Kimmo Kettunen, et al.
0

Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so far been studied in data-oriented scenarios regarding the effectiveness of retrieval results. Such studies have either focused on the effects of artificially degraded OCR quality (see, e.g., [1-2]) or utilized test collections containing texts based on authentic low quality OCR data (see, e.g., [3]). In this paper the effects of OCR quality are studied in a user-oriented information retrieval setting. Thirty-two users evaluated subjectively query results of six topics each (out of 30 topics) based on pre-formulated queries using a simulated work task setting. To the best of our knowledge our simulated work task experiment is the first one showing empirically that users' subjective relevance assessments of retrieved documents are affected by a change in the quality of optically read text. Users of historical newspaper collections have so far commented effects of OCR'ed data quality mainly in impressionistic ways, and controlled user environments for studying effects of OCR quality on users' relevance assessments of the retrieval results have so far been missing. To remedy this The National Library of Finland (NLF) set up an experimental query environment for the contents of one Finnish historical newspaper, Uusi Suometar 1869-1918, to be able to compare users' evaluation of search results of two different OCR qualities for digitized newspaper articles. The query interface was able to present the same underlying document for the user based on two alternatives: either based on the lower OCR quality, or based on the higher OCR quality, and the choice was randomized. The users did not know about quality differences in the article texts they evaluated. The main result of the study is that improved optical character recognition quality affects perceived usefulness of historical newspaper articles significantly. The mean average evaluation score for the improved OCR results was 7.94 the old OCR results.

READ FULL TEXT

page 4

page 5

research
06/01/2022

Optical character recognition quality affects perceived usefulness of historical newspaper clippings

Introduction. We study effect of different quality optical character rec...
research
03/03/2022

Do Perceived Gender Biases in Retrieval Results Affect Relevance Judgements?

This work investigates the effect of gender-stereotypical biases in the ...
research
10/23/2017

Does it matter which search engine is used? A user study using post-task relevance judgments

The objective of this research was to find out how the two search engine...
research
01/19/2022

Validating Simulations of User Query Variants

System-oriented IR evaluations are limited to rather abstract understand...
research
11/20/2021

Effects of context, complexity, and clustering on evaluation for math formula retrieval

There are now several test collections for the formula retrieval task, i...
research
07/15/2022

Transcribing Medieval Manuscripts for Machine Learning

In the early twentieth century, many scholars focused on the preparation...
research
03/10/2021

"This Browser is Lightning Fast": The Effects of Message Content on Perceived Performance

With technical performance being similar for various web browsers, impro...

Please sign up or login with your details

Forgot password? Click here to reset