PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

by   Alireza Salemi, et al.

Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents' quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH's ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.


page 1

page 2

page 3

page 4


Multilingual Denoising Pre-training for Neural Machine Translation

This paper demonstrates that multilingual denoising pre-training produce...

Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

The use of multilingual language models for tasks in low and high-resour...

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

We show that unsupervised sequence-segmentation performance can be trans...

Denoising-based UNMT is more robust to word-order divergence than MASS-based UNMT

We aim to investigate whether UNMT approaches with self-supervised pre-t...

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

Neural machine translation (NMT) needs large parallel corpora for state-...

Meta Back-translation

Back-translation is an effective strategy to improve the performance of ...

mmT5: Modular Multilingual Pre-Training Solves Source Language Hallucinations

Multilingual sequence-to-sequence models perform poorly with increased l...

Please sign up or login with your details

Forgot password? Click here to reset