Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and Meaning

07/07/2022
by   Alistair Moffat, et al.
0

A sequence of recent papers has considered the role of measurement scales in information retrieval (IR) experimentation, and presented the argument that (only) uniform-step interval scales should be used, and hence that well-known metrics such as reciprocal rank, expected reciprocal rank, normalized discounted cumulative gain, and average precision, should be either discarded as measurement tools, or adapted so that their metric values lie at uniformly-spaced points on the number line. These papers paint a rather bleak picture of past decades of IR evaluation, at odds with the community's overall emphasis on practical experimentation and measurable improvement. Our purpose in this work is to challenge that position. In particular, we argue that mappings from categorical and ordinal data to sets of points on the number line are valid provided there is an external reason for each target point to have been selected. We first consider the general role of measurement scales, and of categorical, ordinal, interval, ratio, and absolute data collections. In connection with the first two of those categories we also provide examples of the knowledge that is captured and represented by numeric mappings to the real number line. Focusing then on information retrieval, we argue that document rankings are categorical data, and that the role of an effectiveness metric is to provide a single value that represents the usefulness to a user or population of users of any given ranking, with usefulness able to be represented as a continuous variable on a ratio scale. That is, we argue that current IR metrics are well-founded, and, moreover, that those metrics are more meaningful in their current form than in the proposed "intervalized" versions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/07/2021

Towards Meaningful Statements in IR Evaluation. Mapping Evaluation Measures to Interval Scales

Recently, it was shown that most popular IR measures are not interval-sc...
research
04/02/2023

An Intrinsic Framework of Information Retrieval Evaluation Measures

Information retrieval (IR) evaluation measures are cornerstones for dete...
research
12/22/2022

Response to Moffat's Comment on "Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales"

Moffat recently commented on our previous work. Our work focused on how ...
research
09/07/2018

Data Requirements for Evaluation of Personalization of Information Retrieval - A Position Paper

Two key, but usually ignored, issues for the evaluation of methods of pe...
research
05/09/2022

Re-thinking Knowledge Graph Completion Evaluation from an Information Retrieval Perspective

Knowledge graph completion (KGC) aims to infer missing knowledge triples...
research
05/03/2021

SmoothI: Smooth Rank Indicators for Differentiable IR Metrics

Information retrieval (IR) systems traditionally aim to maximize metrics...
research
07/27/2023

On (Normalised) Discounted Cumulative Gain as an Offline Evaluation Metric for Top-n Recommendation

Approaches to recommendation are typically evaluated in one of two ways:...

Please sign up or login with your details

Forgot password? Click here to reset