Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis

02/01/2021
by   Chenpeng Du, et al.
0

Recent researches on both utterance-level and phone-level prosody modelling successfully improve the voice quality and naturalness in text-to-speech synthesis. However, most of them model the prosody with a unimodal distribution such like a single Gaussian, which is not reasonable enough. In this work, we focus on phone-level prosody modelling where we introduce a Gaussian mixture model(GMM) based mixture density network. Our experiments on the LJSpeech dataset demonstrate that GMM can better model the phone-level prosody than a single Gaussian. The subjective evaluations suggest that our method not only significantly improves the prosody diversity in synthetic speech without the need of manual control, but also achieves a better naturalness. We also find that using the additional mixture density network has only very limited influence on inference speed.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset