MuLER: Detailed and Scalable Reference-based Evaluation

05/24/2023
by   Taelin Karidi, et al.
0

We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.

READ FULL TEXT

page 5

page 13

page 17

research
02/02/2018

Quantitative Fine-Grained Human Evaluation of Machine Translation Systems: a Case Study on English to Croatian

This paper presents a quantitative fine-grained manual evaluation approa...
research
10/16/2019

Fine-grained evaluation of German-English Machine Translation based on a Test Suite

We present an analysis of 16 state-of-the-art MT systems on German-Engli...
research
06/08/2023

Reference Matters: Benchmarking Factual Error Correction for Dialogue Summarization with Fine-grained Evaluation Framework

Factuality is important to dialogue summarization. Factual error correct...
research
10/05/2016

Word2Vec vs DBnary: Augmenting METEOR using Vector Representations or Lexical Resources?

This paper presents an approach combining lexico-semantic resources and ...
research
03/26/2017

LEPOR: An Augmented Machine Translation Evaluation Metric

Machine translation (MT) was developed as one of the hottest research to...
research
08/27/2021

Automatic Text Evaluation through the Lens of Wasserstein Barycenters

A new metric to evaluate text generation based on deep contextualized e...

Please sign up or login with your details

Forgot password? Click here to reset