Neural Machine Translation with Byte-Level Subwords

09/07/2019
by   Changhan Wang, et al.
0

Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can unnecessarily take up vocabulary slots and limit its compactness. Representing text at the level of bytes and using the 256 byte set as vocabulary is a potential solution to this issue. High computational cost has however prevented it from being widely deployed or used in practice. In this paper, we investigate byte-level subwords, specifically byte-level BPE (BBPE), which is compacter than character vocabulary and has no out-of-vocabulary tokens, but is more efficient than using pure bytes only is. We claim that contextualizing BBPE embeddings is necessary, which can be implemented by a convolutional or recurrent layer. Our experiments show that BBPE has comparable performance to BPE while its size is only 1/8 of that for BPE. In the multilingual setting, BBPE maximizes vocabulary sharing across many languages and achieves better translation quality. Moreover, we show that BBPE enables transferring models between languages with non-overlapping character sets.

READ FULL TEXT
research
11/25/2019

Korean-to-Chinese Machine Translation using Chinese Character as Pivot Clue

Korean-Chinese is a low resource language pair, but Korean and Chinese h...
research
10/15/2019

On the Importance of Word Boundaries in Character-level Neural Machine Translation

Neural Machine Translation (NMT) models generally perform translation us...
research
10/24/2021

Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models

This work explores the capacities of character-based Neural Machine Tran...
research
05/27/2022

Patching Leaks in the Charformer for Efficient Character-Level Generation

Character-based representations have important advantages over subword-b...
research
10/24/2020

Revisiting Neural Language Modelling with Syllables

Language modelling is regularly analysed at word, subword or character u...
research
05/23/2023

Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

Using a shared vocabulary is common practice in Multilingual Neural Mach...
research
04/16/2021

Robust Open-Vocabulary Translation from Visual Text Representations

Machine translation models have discrete vocabularies and commonly use s...

Please sign up or login with your details

Forgot password? Click here to reset