Exploiting Sentence Order in Document Alignment
In this work, we exploit the simple idea that a document and its translation should contain approximately the same information, in approximately the same order. We propose methods for both document pair candidate generation and candidate re-scoring which incorporate high-level order information. Our method results in 61 result on the WMT16 document alignment shared task. We also apply our method to web-scraped Sinhala-English documents from ParaCrawl and find that our method improves MT performance by 1.2 BLEU over the current ParaCrawl document alignment method.
READ FULL TEXT