GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts

by   Dongbo Wang, et al.

In the context of the rapid development of large language models, we have meticulously trained and introduced the GujiBERT and GujiGPT language models, which are foundational models specifically designed for intelligent information processing of ancient texts. These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters, allowing them to effectively handle various natural language processing tasks related to ancient books, including but not limited to automatic sentence segmentation, punctuation, word segmentation, part-of-speech tagging, entity recognition, and automatic translation. Notably, these models have exhibited exceptional performance across a range of validation tasks using publicly available datasets. Our research findings highlight the efficacy of employing self-supervised methods to further train the models using classical text corpora, thus enhancing their capability to tackle downstream tasks. Moreover, it is worth emphasizing that the choice of font, the scale of the corpus, and the initial model selection all exert significant influence over the ultimate experimental outcomes. To cater to the diverse text processing preferences of researchers in digital humanities and linguistics, we have developed three distinct categories comprising a total of nine model variations. We believe that by sharing these foundational language models specialized in the domain of ancient texts, we can facilitate the intelligent processing and scholarly exploration of ancient literary works and, consequently, contribute to the global dissemination of China's rich and esteemed traditional culture in this new era.


page 1

page 2

page 3

page 4


A Warm Start and a Clean Crawled Corpus – A Recipe for Good Language Models

We train several language models for Icelandic, including IceBERT, that ...

SikuGPT: A Generative Pre-trained Model for Intelligent Information Processing of Ancient Texts from the Perspective of Digital Humanities

The rapid advance in artificial intelligence technology has facilitated ...

On the (In)Effectiveness of Large Language Models for Chinese Text Correction

Recently, the development and progress of Large Language Models (LLMs) h...

OWL: A Large Language Model for IT Operations

With the rapid development of IT operations, it has become increasingly ...

Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models

Recent studies in natural language processing (NLP) have focused on mode...

An Overview on Language Models: Recent Developments and Outlook

Language modeling studies the probability distributions over strings of ...

Making Metadata More FAIR Using Large Language Models

With the global increase in experimental data artifacts, harnessing them...

Please sign up or login with your details

Forgot password? Click here to reset