Improving Lemmatization of Non-Standard Languages with Joint Learning

03/16/2019
by   Enrique Manjavacas, et al.
0

Lemmatization of standard languages is concerned with (i) abstracting over morphological differences and (ii) resolving token-lemma ambiguities of inflected words in order to map them to a dictionary headword. In the present paper we aim to improve lemmatization performance on a set of non-standard historical languages in which the difficulty is increased by an additional aspect (iii): spelling variation due to lacking orthographic standards. We approach lemmatization as a string-transduction task with an encoder-decoder architecture which we enrich with sentence context information using a hierarchical sentence encoder. We show significant improvements over the state-of-the-art when training the sentence encoder jointly for lemmatization and language modeling. Crucially, our architecture does not require POS or morphological annotations, which are not always available for historical corpora. Additionally, we also test the proposed model on a set of typologically diverse standard languages showing results on par or better than a model without enhanced sentence representations and previous state-of-the-art systems. Finally, to encourage future work on processing of non-standard varieties, we release the dataset of non-standard languages underlying the present study, based on openly accessible sources.

READ FULL TEXT
research
06/02/2016

Single-Model Encoder-Decoder with Explicit Morphological Representation for Reinflection

Morphological reinflection is the task of generating a target form given...
research
11/23/2016

Learning Generic Sentence Representations Using Convolutional Neural Networks

We propose a new encoder-decoder approach to learn distributed sentence ...
research
10/21/2020

LemMED: Fast and Effective Neural Morphological Analysis with Short Context Windows

We present LemMED, a character-level encoder-decoder for contextual morp...
research
10/28/2017

Exploring Asymmetric Encoder-Decoder Structure for Context-based Sentence Representation Learning

Context information plays an important role in human language understand...
research
06/13/2019

A Computational Analysis of Natural Languages to Build a Sentence Structure Aware Artificial Neural Network

Natural languages are complexly structured entities. They exhibit charac...
research
07/04/2023

Transformed Protoform Reconstruction

Protoform reconstruction is the task of inferring what morphemes or word...
research
01/28/2021

Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources

We propose a novel hybrid approach to lemmatization that enhances the se...

Please sign up or login with your details

Forgot password? Click here to reset