An analysis on the effects of speaker embedding choice in non auto-regressive TTS

07/19/2023
by   Adriana Stan, et al.
0

In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.

READ FULL TEXT

page 3

page 4

research
08/04/2020

MIRNet: Learning multiple identities representations in overlapped speech

Many approaches can derive information about a single speaker's identity...
research
05/20/2021

Speaker disentanglement in video-to-speech conversion

The task of video-to-speech aims to translate silent video of lip moveme...
research
11/29/2022

Hiding speaker's sex in speech using zero-evidence speaker representation in an analysis/synthesis pipeline

The use of modern vocoders in an analysis/synthesis pipeline allows us t...
research
06/03/2021

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Multi-speaker spoken datasets enable the creation of text-to-speech synt...
research
10/21/2020

Learning Speaker Embedding from Text-to-Speech

Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker vo...
research
02/06/2023

Residual Information in Deep Speaker Embedding Architectures

Speaker embeddings represent a means to extract representative vectorial...
research
11/18/2019

Language Aided Speaker Diarization Using Speaker Role Information

Speaker diarization relies on the assumption that acoustic embeddings fr...

Please sign up or login with your details

Forgot password? Click here to reset