NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain

05/18/2020
by   Boxiang Liu, et al.
0

Machine translation requires large amounts of parallel text. While such datasets are abundant in domains such as newswire, they are less accessible in the biomedical domain. Chinese and English are two of the most widely spoken languages, yet to our knowledge a parallel corpus in the biomedical domain does not exist for this language pair. In this study, we develop an effective pipeline to acquire and process an English-Chinese parallel corpus, consisting of about 100,000 sentence pairs and 3,000,000 tokens on each side, from the New England Journal of Medicine (NEJM). We show that training on out-of-domain data and fine-tuning with as few as 4,000 NEJM sentence pairs improve translation quality by 25.3 (13.4) BLEU for en→zh (zh→en) directions. Translation quality continues to improve at a slower pace on larger in-domain datasets, with an increase of 33.0 (24.3) BLEU for en→zh (zh→en) directions on the full dataset.

READ FULL TEXT

page 3

page 6

research
11/28/2022

Summer: WeChat Neural Machine Translation Systems for the WMT22 Biomedical Translation Task

This paper introduces WeChat's participation in WMT 2022 shared biomedic...
research
10/11/2020

Addressing Exposure Bias With Document Minimum Risk Training: Cambridge at the WMT20 Biomedical Translation Task

The 2020 WMT Biomedical translation task evaluated Medline abstract tran...
research
04/17/2021

Sentence Alignment with Parallel Documents Helps Biomedical Machine Translation

The existing neural machine translation system has achieved near human-l...
research
03/03/2019

Detecting dementia in Mandarin Chinese using transfer learning from a parallel corpus

Machine learning has shown promise for automatic detection of Alzheimer'...
research
05/10/2022

ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair

We release our synthetic parallel paraphrase corpus across 17 languages:...
research
12/20/2022

Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

In the era of digital healthcare, the huge volumes of textual informatio...
research
09/18/2020

Unsupervised Parallel Corpus Mining on Web Data

With a large amount of parallel data, neural machine translation systems...

Please sign up or login with your details

Forgot password? Click here to reset