SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining

06/01/2021
by   Chenglei Si, et al.
0

Conventional tokenization methods for Chinese pretrained language models (PLMs) treat each character as an indivisible token (Devlin et al., 2019), which ignores the characteristics of the Chinese writing system. In this work, we comprehensively study the influences of three main factors on the Chinese tokenization for PLM: pronunciation, glyph (i.e., shape), and word boundary. Correspondingly, we propose three kinds of tokenizers: 1) SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers; 3) Word segmented tokenizers, the tokenizers with Chinese word segmentation. To empirically compare the effectiveness of studied tokenizers, we pretrain BERT-style language models with them and evaluate the models on various downstream NLU tasks. We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers, while Chinese word segmentation shows no benefit as a preprocessing step. Moreover, the proposed SHUOWEN and JIEZI tokenizers exhibit significantly better robustness in handling noisy texts. The code and pretrained models will be publicly released to facilitate linguistically informed Chinese NLP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/20/2023

Character, Word, or Both? Revisiting the Segmentation Granularity for Chinese Pre-trained Language Models

Pretrained language models (PLMs) have shown marvelous improvements acro...
research
11/17/2020

MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining

Despite the development of pre-trained language models (PLMs) significan...
research
03/01/2022

"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Whole word masking (WWM), which masks all subwords corresponding to a wo...
research
03/23/2023

Retrieval-Augmented Classification with Decoupled Representation

Pretrained language models (PLMs) have shown marvelous improvements acro...
research
04/26/2022

Pretraining Chinese BERT for Detecting Word Insertion and Deletion Errors

Chinese BERT models achieve remarkable progress in dealing with grammati...
research
08/01/2022

iOCR: Informed Optical Character Recognition for Election Ballot Tallies

The purpose of this study is to explore the performance of Informed OCR ...
research
09/17/2017

Character Distributions of Classical Chinese Literary Texts: Zipf's Law, Genres, and Epochs

We collect 14 representative corpora for major periods in Chinese histor...

Please sign up or login with your details

Forgot password? Click here to reset