Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders

01/13/2020
by   Kangle Deng, et al.
6

We present an unsupervised approach that enables us to convert the speech input of any one individual to an output set of potentially-infinitely many speakers. One can stand in front of a mic and be able to make their favorite celebrity say the same words. Our approach builds on simple autoencoders that project out-of-sample data to the distribution of the training set (motivated by PCA/linear autoencoders). We use an exemplar autoencoder to learn the voice and specific style (emotions and ambiance) of a target speaker. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers in a very little time using only two-three minutes of audio data from a speaker. We also exhibit the usefulness of our approach for generating video from audio signals and vice-versa. We suggest the reader to check out our project webpage for various synthesized examples: https://dunbar12138.github.io/projectpage/Audiovisual/

READ FULL TEXT

page 1

page 2

page 4

page 5

page 6

page 7

research
05/09/2019

Adversarially Trained Autoencoders for Parallel-Data-Free Voice Conversion

We present a method for converting the voices between a set of speakers....
research
11/17/2020

Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher

Singing voice synthesis has been paid rising attention with the rapid de...
research
04/30/2019

Many-to-Many Voice Conversion with Out-of-Dataset Speaker Support

We present a Cycle-GAN based many-to-many voice conversion method that c...
research
05/31/2017

Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

In this paper, we present a system that associates faces with voices in ...
research
02/20/2018

Fitting New Speakers Based on a Short Untranscribed Sample

Learning-based Text To Speech systems have the potential to generalize f...
research
01/26/2022

J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis

In this paper, we construct a Japanese audiobook speech corpus called "J...
research
08/31/2023

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Existing automated dubbing methods are usually designed for Professional...

Please sign up or login with your details

Forgot password? Click here to reset