LENS: A Learnable Evaluation Metric for Text Simplification

12/19/2022
by   Mounica Maddela, et al.
0

Training learnable metrics using modern language models has recently emerged as a promising method for the automatic evaluation of machine translation. However, existing human evaluation datasets in text simplification are limited by a lack of annotations, unitary simplification types, and outdated models, making them unsuitable for this approach. To address these issues, we introduce the SIMPEVAL corpus that contains: SIMPEVAL_ASSET, comprising 12K human ratings on 2.4K simplifications of 24 systems, and SIMPEVAL_2022, a challenging simplification benchmark consisting of over 1K human ratings of 360 simplifications including generations from GPT-3.5. Training on SIMPEVAL_ASSET, we present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive empirical results show that LENS correlates better with human judgment than existing metrics, paving the way for future progress in the evaluation of text simplification. To create the SIMPEVAL datasets, we introduce RANK RATE, a human evaluation framework that rates simplifications from several models in a list-wise manner by leveraging an interactive interface, which ensures both consistency and accuracy in the evaluation process. Our metric, dataset, and annotation toolkit are available at https://github.com/Yao-Dou/LENS.

READ FULL TEXT
research
08/21/2021

CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements

Human evaluation has always been expensive while researchers struggle to...
research
12/12/2022

T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics

Modern embedding-based metrics for evaluation of generated text generall...
research
01/25/2022

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

We present a novel benchmark and associated evaluation metrics for asses...
research
05/23/2023

Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA

Large language models (e.g., GPT-3.5) are uniquely capable of producing ...
research
07/30/2023

Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models

The field of large language models (LLMs) has made significant progress,...
research
03/11/2022

Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Recent studies have shown the advantages of evaluating NLG systems using...
research
03/14/2023

Eliciting Latent Predictions from Transformers with the Tuned Lens

We analyze transformers from the perspective of iterative inference, see...

Please sign up or login with your details

Forgot password? Click here to reset