Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

by   Deepanway Ghosal, et al.
Singapore University of Technology and Design

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation – a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.


page 1

page 3

page 13


Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Instruction fine-tuning has recently emerged as a promising approach for...

Pengi: An Audio Language Model for Audio Tasks

In the domain of audio processing, Transfer Learning has facilitated the...

Improving Diffusion Models for Scene Text Editing with Dual Encoders

Scene text editing is a challenging task that involves modifying or inse...

Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics

Multi-modal large language models (MLLMs) are trained based on large lan...

GRASS: Unified Generation Model for Speech-to-Semantic Tasks

This paper explores the instruction fine-tuning technique for speech-to-...

LeTI: Learning to Generate from Textual Interactions

Finetuning pre-trained language models (LMs) enhances the models' capabi...

Please sign up or login with your details

Forgot password? Click here to reset