How big is big enough? Unsupervised word sense disambiguation using a very large corpus

10/22/2017
by   Piotr Przybyła, et al.
0

In this paper, the problem of disambiguating a target word for Polish is approached by searching for related words with known meaning. These relatives are used to build a training corpus from unannotated text. This technique is improved by proposing new rich sources of replacements that substitute the traditional requirement of monosemy with heuristics based on wordnet relations. The naïve Bayesian classifier has been modified to account for an unknown distribution of senses. A corpus of 600 million web documents (594 billion tokens), gathered by the NEKST search engine allows us to assess the relationship between training set size and disambiguation accuracy. The classifier is evaluated using both a wordnet baseline and a corpus with 17,314 manually annotated occurrences of 54 ambiguous words.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/28/2009

The Uned systems at Senseval-2

We have participated in the SENSEVAL-2 English tasks (all words and lexi...
research
06/25/2021

Manually Annotated Spelling Error Corpus for Amharic

This paper presents a manually annotated spelling error corpus for Amhar...
research
06/20/2016

Visualizing textual models with in-text and word-as-pixel highlighting

We explore two techniques which use color to make sense of statistical t...
research
07/12/2020

Neural disambiguation of lemma and part of speech in morphologically rich languages

We consider the problem of disambiguating the lemma and part of speech o...
research
06/14/2021

Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus

We introduced the contemporary Amharic corpus, which is automatically ta...
research
06/12/2018

Term Definitions Help Hypernymy Detection

Existing methods of hypernymy detection mainly rely on statistics over a...
research
07/12/2023

A Study on the Appropriate size of the Mongolian general corpus

This study aims to determine the appropriate size of the Mongolian gener...

Please sign up or login with your details

Forgot password? Click here to reset