Towards Massively Multi-domain Multilingual Readability Assessment

05/23/2023
by   Tarek Naous, et al.
5

We present ReadMe++, a massively multi-domain multilingual dataset for automatic readability assessment. Prior work on readability assessment has been mostly restricted to the English language and one or two text domains. Additionally, the readability levels of sentences used in many previous datasets are assumed on the document-level other than sentence-level, which raises doubt about the quality of previous evaluations. We address those gaps in the literature by providing an annotated dataset of 6,330 sentences in Arabic, English, and Hindi collected from 64 different domains of text. Unlike previous datasets, ReadMe++ offers more domain and language diversity and is manually annotated at a sentence level using the Common European Framework of Reference for Languages (CEFR) and through a Rank-and-Rate annotation framework that reduces subjectivity in annotation. Our experiments demonstrate that models fine-tuned using ReadMe++ achieve strong cross-lingual transfer capabilities and generalization to unseen domains. ReadMe++ will be made publicly available to the research community.

READ FULL TEXT
research
10/21/2022

CEFR-Based Sentence Difficulty Annotation and Assessment

Controllable text simplification is a crucial assistive technique for la...
research
10/28/2022

Stanceosaurus: Classifying Stance Towards Multilingual Misinformation

We present Stanceosaurus, a new corpus of 28,033 tweets in English, Hind...
research
05/25/2023

Revisiting non-English Text Simplification: A Unified Multilingual Benchmark

Recent advancements in high-quality, large-scale English resources have ...
research
10/13/2020

Model Selection for Cross-Lingual Transfer using a Learned Scoring Function

Transformers that are pre-trained on multilingual text corpora, such as,...
research
02/24/2023

Cross-Lingual Transfer of Cognitive Processing Complexity

When humans read a text, their eye movements are influenced by the struc...
research
01/03/2018

Sentence Object Notation: Multilingual sentence notation based on Wordnet

The representation of sentences is a very important task. It can be used...
research
03/02/2023

Language Variety Identification with True Labels

Language identification is an important first step in many IR and NLP ap...

Please sign up or login with your details

Forgot password? Click here to reset