Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

by   Mikel Artetxe, et al.

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. Our approach uses an encoder-decoder trained over an initial parallel corpus to build multilingual sentence representations, which are then incorporated into a new margin-based method to score, mine and filter parallel sentences. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC shared task on parallel corpus mining by more than 10 F1 points. We also improve the precision from 48.9 to 83.3 on the reconstruction of 11.3M English-French sentence pairs of the UN corpus. Finally, filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.


page 1

page 2

page 3

page 4


Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

In this paper, we describe our submission to the WMT19 low-resource para...

Filtering and Mining Parallel Data in a Joint Multilingual Space

We learn a joint multilingual sentence embedding and use the distance be...

Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

We describe an unsupervised method to create pseudo-parallel corpora for...

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

We show that margin-based bitext mining in a multilingual sentence space...

Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

In this paper, we present an approach to learn multilingual sentence emb...

Hierarchical Document Encoder for Parallel Corpus Mining

We explore using multilingual document embeddings for nearest neighbor m...

Sentence Simplification Using Paraphrase Corpus for Initialization

Neural sentence simplification method based on sequence-to-sequence fram...

Please sign up or login with your details

Forgot password? Click here to reset