Albanian Language Identification in Text Documents

01/14/2019
by   Klesti Hoxha, et al.
0

In this work we investigate the accuracy of standard and state-of-the-art language identification methods in identifying Albanian in written text documents. A dataset consisting of news articles written in Albanian has been constructed for this purpose. We noticed a considerable decrease of accuracy when using test documents that miss the Albanian alphabet letters " Ë " and " Ç " and created a custom training corpus that solved this problem by achieving an accuracy of more than 99 performing language identification methods for Albanian use a naïve Bayes classifier and n-gram based classification features.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset