SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation

03/31/2019
by   Albina Khusainova, et al.
0

There is a huge imbalance between languages currently spoken and corresponding resources to study them. Most of the attention naturally goes to the "big" languages: those which have the largest presence in terms of media and number of speakers. Other less represented languages sometimes do not even have a good quality corpus to study them. In this paper, we tackle this imbalance by presenting a new set of evaluation resources for Tatar, a language of the Turkic language family which is mainly spoken in Tatarstan Republic, Russia. We present three datasets: Similarity and Relatedness datasets that consist of human scored word pairs and can be used to evaluate semantic models; and Analogies dataset that comprises analogy questions and allows to explore semantic, syntactic, and morphological aspects of language modeling. All three datasets build upon existing datasets for the English language and follow the same structure. However, they are not mere translations. They take into account specifics of the Tatar language and expand beyond the original datasets. We evaluate state-of-the-art word embedding models for two languages using our proposed datasets for Tatar and the original datasets for English and report our findings on performance comparison.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2021

Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Urdu is a widely spoken language in South Asia. Though immoderate litera...
research
03/11/2021

Evaluation of Morphological Embeddings for English and Russian Languages

This paper evaluates morphology-based embeddings for English and Russian...
research
07/18/2016

Language classification from bilingual word embedding graphs

We study the role of the second language in bilingual word embeddings in...
research
04/15/2018

Introducing two Vietnamese Datasets for Evaluating Semantic Models of (Dis-)Similarity and Relatedness

We present two novel datasets for the low-resource language Vietnamese t...
research
03/04/2019

Russian Language Datasets in the Digitial Humanities Domain and Their Evaluation with Word Embeddings

In this paper, we present Russian language datasets in the digital human...
research
06/10/2022

Teacher Perception of Automatically Extracted Grammar Concepts for L2 Language Learning

One of the challenges of language teaching is how to organize the rules ...

Please sign up or login with your details

Forgot password? Click here to reset