Learning and controlling the source-filter representation of speech with a variational autoencoder

by   Samir Sadok, et al.

Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency f_0 and the formants are of primary importance. In this work, we show that the source-filter model of speech production naturally arises in the latent space of a variational autoencoder (VAE) trained in an unsupervised manner on a dataset of natural speech signals. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we experimentally illustrate that f_0 and the formant frequencies are encoded in orthogonal subspaces of the VAE latent space and we develop a weakly-supervised method to accurately and independently control these speech factors of variation within the learned latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on f_0 and the formant frequencies, and which is applied to the transformation of speech signals.


page 8

page 9

page 14


A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

The Variational Autoencoder (VAE) is a powerful deep generative model th...

Learning robust speech representation with an articulatory-regularized variational autoencoder

It is increasingly considered that human speech perception and productio...

Exploring TTS without T Using Biologically/Psychologically Motivated Neural Network Modules (ZeroSpeech 2020)

In this study, we reported our exploration of Text-To-Speech without Tex...

Disentangling Speech and Non-Speech Components for Building Robust Acoustic Models from Found Data

In order to build language technologies for majority of the languages, i...

Modeling neural dynamics during speech production using a state space variational autoencoder

Characterizing the neural encoding of behavior remains a challenging tas...

Scalable Factorized Hierarchical Variational Autoencoder Training

Deep generative models have achieved great success in unsupervised learn...

Deep Generative Networks for Heterogeneous Augmentation of Cranial Defects

The design of personalized cranial implants is a challenging and tremend...

Please sign up or login with your details

Forgot password? Click here to reset