Is it worth it? Budget-related evaluation metrics for model selection

07/18/2018
by   Filip Klubička, et al.
0

Creating a linguistic resource is often done by using a machine learning model that filters the content that goes through to a human annotator, before going into the final resource. However, budgets are often limited, and the amount of available data exceeds the amount of affordable annotation. In order to optimize the benefit from the invested human work, we argue that deciding on which model one should employ depends not only on generalized evaluation metrics such as F-score, but also on the gain metric. Because the model with the highest F-score may not necessarily have the best sequencing of predicted classes, this may lead to wasting funds on annotating false positives, yielding zero improvement of the linguistic resource. We exemplify our point with a case study, using real data from a task of building a verb-noun idiom dictionary. We show that, given the choice of three systems with varying F-scores, the system with the highest F-score does not yield the highest profits. In other words, in our case the cost-benefit trade off is more favorable for a system with a lower F-score.

READ FULL TEXT
research
04/12/2016

What do different evaluation metrics tell us about saliency models?

How best to evaluate a saliency model's ability to predict where humans ...
research
06/29/2020

ANA at SemEval-2020 Task 4: mUlti-task learNIng for cOmmonsense reasoNing (UNION)

In this paper, we describe our mUlti-task learNIng for cOmmonsense reaso...
research
05/24/2022

A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation – through the Lens of Semantic Similarity Rating

Evaluating the quality of generated text is difficult, since traditional...
research
07/27/2023

Models of reference production: How do they withstand the test of time?

In recent years, many NLP studies have focused solely on performance imp...
research
10/08/2021

Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors

Evaluation metrics are a key ingredient for progress of text generation ...
research
04/18/2021

Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement Learning

Aiming to generate a set of keyphrases, Keyphrase Generation (KG) is a c...
research
10/17/2021

Data Shapley Value for Handling Noisy Labels: An application in Screening COVID-19 Pneumonia from Chest CT Scans

A long-standing challenge of deep learning models involves how to handle...

Please sign up or login with your details

Forgot password? Click here to reset