Toxicity in Multilingual Machine Translation at Scale

10/06/2022
by   Marta R. Costa-Jussà, et al.
1

Machine Translation systems can produce different types of errors, some of which get characterized as critical or catastrophic due to the specific negative impact they can have on users. Automatic or human evaluation metrics do not necessarily differentiate between such critical errors and more innocuous ones. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. The toxicity automatic evaluation shows that added toxicity across languages varies from 0 with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. We observe that the source contribution is somewhat correlated with toxicity but that 45.6 suggesting that much of the added toxicity may be due to mistranslations. Combining the signal of source contribution level with a measurement of translation robustness allows us to flag 22.3 that added toxicity may be related to both hallucination and the stability of translations in different contexts. Given these findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations.

READ FULL TEXT

page 5

page 7

page 14

research
06/06/2021

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

One of the biggest challenges hindering progress in low-resource and mul...
research
09/28/2022

An Automatic Evaluation of the WMT22 General Machine Translation Task

This report presents an automatic evaluation of the general machine tran...
research
05/22/2023

Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale

We introduce a multilingual extension of the HOLISTICBIAS dataset, the l...
research
02/18/2023

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

Generative Pre-trained Transformer (GPT) models have shown remarkable ca...
research
05/19/2023

HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation

Hallucinations in machine translation are translations that contain info...
research
09/29/2021

BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text

Social media companies as well as authorities make extensive use of arti...
research
11/22/2018

Should we adjust for pupil background in school value-added models? A study of Progress 8 and school accountability in England

In the UK, US and elsewhere, school accountability systems increasingly ...

Please sign up or login with your details

Forgot password? Click here to reset