#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models

08/14/2023
by   Keming Lu, et al.
0

Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. Then we analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. The resulting models, TagLM, outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity. We open-source InsTag in https://github.com/OFA-Sys/InsTag.

READ FULL TEXT

page 3

page 6

page 15

research
06/14/2023

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Code Large Language Models (Code LLMs), such as StarCoder, have demonstr...
research
05/23/2023

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Fine-tuning on instruction data has been widely validated as an effectiv...
research
09/11/2023

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

We introduce MAmmoTH, a series of open-source large language models (LLM...
research
07/26/2023

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

Grammatical error correction aims to correct ungrammatical sentences aut...
research
04/17/2023

Chinese Open Instruction Generalist: A Preliminary Release

Instruction tuning is widely recognized as a key technique for building ...
research
05/24/2023

ExpertPrompting: Instructing Large Language Models to be Distinguished Experts

The answering quality of an aligned large language model (LLM) can be dr...
research
09/07/2023

Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty

Open Information Extraction (OIE) task aims at extracting structured fac...

Please sign up or login with your details

Forgot password? Click here to reset