Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS

by   Myeongjin Ko, et al.

The diffusion model is capable of generating high-quality data through a probabilistic approach. However, it suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. To address this limitation, recent models such as denoising diffusion implicit models (DDIM) focus on generating samples without directly modeling the probability distribution, while models like denoising diffusion generative adversarial networks (GAN) combine diffusion processes with GANs. In the field of speech synthesis, a recent diffusion speech synthesis model called DiffGAN-TTS, utilizing the structure of GANs, has been introduced and demonstrates superior performance in both speech quality and generation speed. In this paper, to further enhance the performance of DiffGAN-TTS, we propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data. Objective metrics such as structural similarity index measure (SSIM), mel-cepstral distortion (MCD), F0 root mean squared error (F0 RMSE), short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), as well as subjective metrics like mean opinion score (MOS), are used to evaluate the performance of the proposed model. The evaluation results show that the proposed model outperforms recent state-of-the-art models such as FastSpeech2 and DiffGAN-TTS in various metrics. Our implementation and audio samples are located on GitHub.


page 7

page 8


DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Denoising diffusion probabilistic models (DDPMs) are expressive generati...

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Expressive text-to-speech systems have undergone significant advancement...

Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer

Diffusion based vocoders have been criticised for being slow due to the ...

Diffusion Probabilistic Model Based Accurate and High-Degree-of-Freedom Metasurface Inverse Design

Conventional meta-atom designs rely heavily on researchers' prior knowle...

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

We propose AudioStyleGAN (ASGAN), a new generative adversarial network (...

Diffusion-based Signal Refiner for Speech Separation

We have developed a diffusion-based speech refiner that improves the ref...

Don't be so negative! Score-based Generative Modeling with Oracle-assisted Guidance

The maximum likelihood principle advocates parameter estimation via opti...

Please sign up or login with your details

Forgot password? Click here to reset