Censorship of Online Encyclopedias: Implications for NLP Models

01/22/2021
by   Eddie Yang, et al.
10

While artificial intelligence provides the backbone for many tools people use around the world, recent work has brought to attention that the algorithms powering AI are not free of politics, stereotypes, and bias. While most work in this area has focused on the ways in which AI can exacerbate existing inequalities and discrimination, very little work has studied how governments actively shape training data. We describe how censorship has affected the development of Wikipedia corpuses, text data which are regularly used for pre-trained inputs into NLP algorithms. We show that word embeddings trained on Baidu Baike, an online Chinese encyclopedia, have very different associations between adjectives and a range of concepts about democracy, freedom, collective action, equality, and people and historical events in China than its regularly blocked but uncensored counterpart - Chinese language Wikipedia. We examine the implications of these discrepancies by studying their use in downstream AI applications. Our paper shows how government repression, censorship, and self-censorship may impact training data and the applications that draw from them.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2021

Gender Bias Hidden Behind Chinese Word Embeddings: The Case of Chinese Adjectives

Gender bias in word embeddings gradually becomes a vivid research field ...
research
07/12/2021

How Could Equality and Data Protection Law Shape AI Fairness for People with Disabilities?

This article examines the concept of 'AI fairness' for people with disab...
research
08/29/2022

A systematic review of research on the use and impact of technology for learning Chinese

In light of technological development enforced by the pandemic, learning...
research
06/19/2019

Pre-Training with Whole Word Masking for Chinese BERT

Bidirectional Encoder Representations from Transformers (BERT) has shown...
research
12/02/2020

A Framework and Dataset for Abstract Art Generation via CalligraphyGAN

With the advancement of deep learning, artificial intelligence (AI) has ...
research
11/04/2022

Generation of Chinese classical poetry based on pre-trained model

In order to test whether artificial intelligence can create qualified cl...
research
12/06/2021

Analyzing a Carceral Algorithm used by the Pennsylvania Department of Corrections

Scholars have focused on algorithms used during sentencing, bail, and pa...

Please sign up or login with your details

Forgot password? Click here to reset