Meta-evaluation of Conversational Search Evaluation Metrics

04/27/2021
by   Zeyang Liu, et al.
0

Conversational search systems, such as Google Assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remains to be investigated. In this paper, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect "actual" performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics varies significantly across different scenarios whereas consistent with prior studies, existing metrics only achieve a weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

READ FULL TEXT
research
09/07/2021

POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling

Conversational search systems, such as Google Assistant and Microsoft Co...
research
02/01/2018

Correlation and Prediction of Evaluation Metrics in Information Retrieval

Because researchers typically do not have the time or space to present m...
research
04/17/2022

Evaluating Mixed-initiative Conversational Search Systems via User Simulation

Clarifying the underlying user information need by asking clarifying que...
research
05/28/2023

ConvGenVisMo: Evaluation of Conversational Generative Vision Models

Conversational generative vision models (CGVMs) like Visual ChatGPT (Wu ...
research
04/25/2022

Offline Retrieval Evaluation Without Evaluation Metrics

Offline evaluation of information retrieval and recommendation has tradi...
research
06/02/2020

Quantifying the Effects of Prosody Modulation on User Engagement and Satisfaction in Conversational Systems

As voice-based assistants such as Alexa, Siri, and Google Assistant beco...
research
07/20/2023

Learning and Evaluating Human Preferences for Conversational Head Generation

A reliable and comprehensive evaluation metric that aligns with manual p...

Please sign up or login with your details

Forgot password? Click here to reset