BPE and CharCNNs for Translation of Morphology: A Cross-Lingual Comparison and Analysis

09/05/2018
by   Pamela Shapiro, et al.
0

Neural Machine Translation (NMT) in low-resource settings and of morphologically rich languages is made difficult in part by data sparsity of vocabulary words. Several methods have been used to help reduce this sparsity, notably Byte-Pair Encoding (BPE) and a character-based CNN layer (charCNN). However, the charCNN has largely been neglected, possibly because it has only been compared to BPE rather than combined with it. We argue for a reconsideration of the charCNN, based on cross-lingual improvements on low-resource data. We translate from 8 languages into English, using a multi-way parallel collection of TED transcripts. We find that in most cases, using both BPE and a charCNN performs best, while in Hebrew, using a charCNN over words is best.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2021

Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Cognates are variants of the same lexical form across different language...
research
05/14/2019

Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies

Transfer learning or multilingual model is essential for low-resource ne...
research
04/21/2018

Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

We work on translation from rich-resource languages to low-resource lang...
research
07/22/2020

Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models

Character-based Neural Network Language Models (NNLM) have the advantage...
research
05/09/2023

Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages

We address the task of machine translation from an extremely low-resourc...
research
05/24/2021

Neural Machine Translation with Monolingual Translation Memory

Prior work has proved that Translation memory (TM) can boost the perform...
research
05/23/2022

Local Byte Fusion for Neural Machine Translation

Subword tokenization schemes are the dominant technique used in current ...

Please sign up or login with your details

Forgot password? Click here to reset