ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

05/22/2023
by   Huadai Liu, et al.
0

Text-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio depends not solely on its content, pitch, rhythm, and energy, but also on the physical environment. In this work, we propose ViT-TTS, the first visual TTS model with scalable diffusion transformers. ViT-TTS complement the phoneme sequence with the visual information to generate high-perceived audio, opening up new avenues for practical applications of AR and VR to allow a more immersive and realistic audio experience. To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information. Experimental results demonstrate that ViT-TTS achieves new state-of-the-art results, outperforming cascaded systems and other baselines regardless of the visibility of the scene. With low-resource data (1h, 2h, 5h), ViT-TTS achieves comparative results with rich-resource baselines. [Audio samples are available at <https://ViT-TTS.github.io/.>]

READ FULL TEXT

page 3

page 8

research
05/24/2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Direct speech-to-speech translation (S2ST) aims to convert speech from o...
research
01/30/2023

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Large-scale multimodal generative modeling has created milestones in tex...
research
02/14/2022

Visual Acoustic Matching

We introduce the visual acoustic matching task, in which an audio clip i...
research
12/18/2022

BEATs: Audio Pre-Training with Acoustic Tokenizers

The massive growth of self-supervised learning (SSL) has been witnessed ...
research
05/20/2023

ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

Text to Speech (TTS) models can generate natural and high-quality speech...
research
04/06/2021

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling

In this work, we introduce NU-Wave, the first neural audio upsampling mo...
research
07/27/2021

The CORSMAL benchmark for the prediction of the properties of containers

Acoustic and visual sensing can support the contactless estimation of th...

Please sign up or login with your details

Forgot password? Click here to reset