TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design

09/11/2023
by   Yongrui Chen, et al.
0

High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collection methods are limited by unrealistic manual labeling costs or by the hallucination of relying solely on LLM generation. To address the problems, this paper presents a scalable method to automatically collect high-quality instructional adaptation data by training language models to automatically design tasks based on human-written texts. Intuitively, human-written text helps to help the model attenuate illusions during the generation of tasks. Unlike instruction back-translation-based methods that directly take the given text as a response, we require the model to generate the instruction, input, and output simultaneously to filter the noise. The results of the automated and manual evaluation experiments demonstrate the quality of our dataset.

READ FULL TEXT
research
08/24/2023

Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models

Instruction tuning is instrumental in enabling Large Language Models (LL...
research
08/11/2023

Self-Alignment with Instruction Backtranslation

We present a scalable method to build a high quality instruction followi...
research
05/29/2021

Constructing Flow Graphs from Procedural Cybersecurity Texts

Following procedural texts written in natural languages is challenging. ...
research
03/13/2014

Scalable and Robust Construction of Topical Hierarchies

Automated generation of high-quality topical hierarchies for a text coll...
research
08/22/2023

Towards an On-device Agent for Text Rewriting

Large Language Models (LLMs) have demonstrated impressive capabilities f...
research
07/31/2023

An Effective Data Creation Pipeline to Generate High-quality Financial Instruction Data for Large Language Model

At the beginning era of large language model, it is quite critical to ge...
research
07/10/2023

TIM: Teaching Large Language Models to Translate with Comparison

Open-sourced large language models (LLMs) have demonstrated remarkable e...

Please sign up or login with your details

Forgot password? Click here to reset