Diacritics Restoration using BERT with Analysis on Czech language

05/24/2021
by   Jakub Náplava, et al.
0

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44 actually not errors, but either plausible variants (19 corrections of erroneous data (25 detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset