A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning

by   Samir Sadok, et al.

In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.


page 9

page 10

page 11

page 12

page 13

page 21


A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

The Variational Autoencoder (VAE) is a powerful deep generative model th...

A vector quantized masked autoencoder for speech emotion recognition

Recent years have seen remarkable progress in speech emotion recognition...

Unsupervised Representation Learning of Speech for Dialect Identification

In this paper, we explore the use of a factorized hierarchical variation...

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

In this paper we propose a Sequential Representation Quantization AutoEn...

Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE

The human perception system is often assumed to recruit motor knowledge ...

Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data

Multimodal sensory data resembles the form of information perceived by h...

Disentangling Speech and Non-Speech Components for Building Robust Acoustic Models from Found Data

In order to build language technologies for majority of the languages, i...

Please sign up or login with your details

Forgot password? Click here to reset