llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology

05/22/2023
by   Masanori Hirano, et al.
0

This study constructed a Japanese chat dataset for tuning large language models (LLMs), which consist of about 8.4 million records. Recently, LLMs have been developed and gaining popularity. However, high-performing LLMs are usually mainly for English. There are two ways to support languages other than English by those LLMs: constructing LLMs from scratch or tuning existing models. However, in both ways, datasets are necessary parts. In this study, we focused on supporting Japanese in those LLMs and making a dataset for training or tuning LLMs in Japanese. The dataset we constructed consisted of various tasks, such as translation and knowledge tasks. In our experiment, we tuned an existing LLM using our dataset and evaluated the performance qualitatively. The results suggest that our dataset is possibly beneficial for LLMs. However, we also revealed some difficulties in constructing LLMs in languages other than English.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/07/2023

From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models

Instruction tuning is essential for large language models (LLMs) to beco...
research
06/19/2023

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

Large language models (LLMs) have demonstrated remarkable prowess in lan...
research
08/09/2023

Extrapolating Large Language Models to Non-English by Aligning Languages

Due to the unbalanced training data distribution, the language ability o...
research
02/24/2023

Fairness in Language Models Beyond English: Gaps and Challenges

With language models becoming increasingly ubiquitous, it has become ess...
research
05/22/2023

PrOnto: Language Model Evaluations for 859 Languages

Evaluation datasets are critical resources for measuring the quality of ...
research
08/27/2021

Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

Social media has effectively become the prime hub of communication and d...
research
05/23/2022

KOLD: Korean Offensive Language Dataset

Although large attention has been paid to the detection of hate speech, ...

Please sign up or login with your details

Forgot password? Click here to reset