A Variational Prosody Model for the decomposition and synthesis of speech prosody

by   Branislav Gerazov, et al.

The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-to-end mappings using millions of tunable parameters. The shift towards machine learning models has nonetheless posed the reverse problem - a compelling need to discover knowledge, to explain, visualise and interpret. Our work bridges between a comprehensive generative model of intonation and state-of-the-art AI techniques. We build upon the modelling paradigm of the Superposition of Functional Contours model and propose a Variational Prosody Model (VPM) that uses a network of deep variational contour generators to capture the context-sensitive variation of the constituent elementary prosodic cliches. We show that the VPM can give insight into the intrinsic variability of these prosodic prototypes through learning a meaningful prosodic latent space representation structure. We also show that the VPM brings improved modelling performance especially when such variability is prominent. In a speech synthesis scenario we believe the model can be used to generate a dynamic and natural prosody contour largely devoid of averaging effects.


page 1

page 2

page 3

page 4


A Weighted Superposition of Functional Contours Model for Modelling Contextual Prominence of Elementary Prosodic Contours

The way speech prosody encodes linguistic, paralinguistic and non-lingui...

Learning Latent Representations for Speech Generation and Transformation

An ability to model a generative process and learn a latent representati...

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

The prosodic aspects of speech signals produced by current text-to-speec...

Introducing Variational Autoencoders to High School Students

Generative Artificial Intelligence (AI) models are a compelling way to i...

An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

Speech is the fundamental mode of human communication, and its synthesis...

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Recent work has explored sequence-to-sequence latent variable models for...

Please sign up or login with your details

Forgot password? Click here to reset