Diacritics Restoration using BERT with Analysis on Czech language
We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44 actually not errors, but either plausible variants (19 corrections of erroneous data (25 detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.
READ FULL TEXT