GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis

05/15/2022
by   Rongjie Huang, et al.
0

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting. Audio samples are available at <https://GenerSpeech.github.io/>

READ FULL TEXT

page 3

page 8

research
07/30/2023

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Despite rapid progress in the voice style transfer (VST) field, recent z...
research
11/14/2021

Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning

The task of few-shot style transfer for voice cloning in text-to-speech ...
research
06/16/2021

Global Rhythm Style Transfer Without Text Transcriptions

Prosody plays an important role in characterizing the style of a speaker...
research
09/23/2021

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

One-shot voice cloning aims to transform speaker voice and speaking styl...
research
11/04/2022

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

Expressive text-to-speech (TTS) can synthesize a new speaking style by i...
research
10/25/2019

Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency

Current multi-reference style transfer models for Text-to-Speech (TTS) p...
research
08/09/2023

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

Current talking face generation methods mainly focus on speech-lip synch...

Please sign up or login with your details

Forgot password? Click here to reset