Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training

03/31/2021
by   Kun Zhou, et al.
0

Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. We note that the proposed EVC framework leverages text-to-speech (TTS) as they share a common goal that is to generate high-quality expressive voice. In stage 1, we perform style initialization with a multi-speaker TTS corpus, to disentangle speaking style and linguistic content. In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.

READ FULL TEXT
research
10/20/2021

Identity Conversion for Emotional Speakers: A Study for Disentanglement of Emotion Style and Speaker Identity

Expressive voice conversion performs identity conversion for emotional s...
research
07/18/2021

An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation

Emotional Voice Conversion (EVC) aims to convert the emotional style of ...
research
11/11/2019

Emotional Voice Conversion using multitask learning with Text-to-speech

Voice conversion (VC) is a task to transform a person's voice to differe...
research
02/01/2020

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Emotional voice conversion is to convert the spectrum and prosody to cha...
research
03/29/2022

An Overview Analysis of Sequence-to-Sequence Emotional Voice Conversion

Emotional voice conversion (EVC) focuses on converting a speech utteranc...
research
11/28/2019

Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

We propose a Text-to-Speech method to create an unseen expressive style ...
research
04/05/2021

StarGAN-based Emotional Voice Conversion for Japanese Phrases

This paper shows that StarGAN-VC, a spectral envelope transformation met...

Please sign up or login with your details

Forgot password? Click here to reset