Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal

10/30/2019
by   Zhiyuan Peng, et al.
0

Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level vector representation. A frame decoder serves to reconstruct speech features from the encoders'outputs. The mFAE is evaluated on speaker verification (SV) task and unsupervised subword modeling (USM) task. The SV experiments on VoxCeleb 1 show that the utterance embedder is capable of extracting speaker-discriminative embeddings with performance comparable to a x-vector baseline. The USM experiments on ZeroSpeech 2017 dataset verify that the frame tokenizer is able to capture linguistic content and the utterance embedder can acquire speaker-related information.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/17/2019

H-VECTORS: Utterance-level Speaker Embedding Using A Hierarchical Attention Model

In this paper, a hierarchical attention network to generate utterance-le...
research
06/19/2021

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

One-shot voice conversion has received significant attention since only ...
research
11/03/2019

Interpreting Verbal Irony: Linguistic Strategies and the Connection to theType of Semantic Incongruity

Human communication often involves the use of verbal irony or sarcasm, w...
research
11/03/2019

Interpreting Verbal Irony: Linguistic Strategies and the Connection to the Type of Semantic Incongruity

Human communication often involves the use of verbal irony or sarcasm, w...
research
06/17/2019

Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation

This study tackles unsupervised subword modeling in the zero-resource sc...
research
10/27/2020

Deep generative factorization for speech signal

Various information factors are blended in speech signals, which forms t...
research
10/29/2019

On Investigation of Unsupervised Speech Factorization Based on Normalization Flow

Speech signals are complex composites of various information, including ...

Please sign up or login with your details

Forgot password? Click here to reset