Harvesting comparable corpora and mining them for equivalent bilingual sentences using statistical classification and analogy- based heuristics

11/18/2015
by   Krzysztof Wołk, et al.
0

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page. The improvements in machine translation are shown on Polish-English language pair for various text domains. We also tested another method of building parallel corpora based on comparable corpora data. It lets automatically broad existing corpus of sentences from subject of corpora based on analogies between them.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2015

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Parallel sentences are a relatively scarce but extremely useful resource...
research
03/22/2016

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they co...
research
12/05/2015

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

The multilingual nature of the world makes translation a crucial require...
research
09/29/2015

Tuned and GPU-accelerated parallel data mining from comparable corpora

The multilingual nature of the world makes translation a crucial require...
research
04/15/2021

Bilingual Terminology Extraction from Non-Parallel E-Commerce Corpora

Bilingual terminologies are important resources for natural language pro...
research
10/15/2015

Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

Text alignment and text quality are critical to the accuracy of Machine ...
research
10/29/2009

Word Sense Disambiguation Using English-Spanish Aligned Phrases over Comparable Corpora

In this paper we describe a WSD experiment based on bilingual English-Sp...

Please sign up or login with your details

Forgot password? Click here to reset