Impact of Corpora Quality on Neural Machine Translation

10/19/2018
by   Matīss Rikters, et al.
0

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them. The solutions are summarised in a set of scripts that remove problematic sentences from input corpora.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/08/2021

Extended Parallel Corpus for Amharic-English Machine Translation

This paper describes the acquisition, preprocessing, segmentation, and a...
research
05/31/2018

On the Impact of Various Types of Noise on Neural Machine Translation

We examine how various types of noise in the parallel training data impa...
research
07/11/2023

Neural Machine Translation Data Generation and Augmentation using ChatGPT

Neural models have revolutionized the field of machine translation, but ...
research
10/16/2018

Multi-Source Neural Machine Translation with Data Augmentation

Multi-source translation systems translate from multiple languages to a ...
research
04/05/2020

Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised Neural Machine Translation

We explore ways of incorporating bilingual dictionaries to enable semi-s...
research
11/03/2017

Towards Neural Machine Translation with Partially Aligned Corpora

While neural machine translation (NMT) has become the new paradigm, the ...
research
05/24/2018

Fast Neural Machine Translation Implementation

This paper describes the submissions to the efficiency track for GPUs by...

Please sign up or login with your details

Forgot password? Click here to reset