SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

04/23/2023
by   Jianzong Wang, et al.
0

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre-training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/24/2022

TESSP: Text-Enhanced Self-Supervised Speech Pre-training

Self-supervised speech pre-training empowers the model with the contextu...
research
06/15/2021

Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis

Text does not fully specify the spoken form, so text-to-speech models mu...
research
10/22/2020

MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation

End-to-end Speech-to-text Translation (E2E- ST), which directly translat...
research
06/06/2022

UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

In this paper, we propose a novel unsupervised text-to-speech (UTTS) fra...
research
04/02/2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, ...
research
06/21/2021

Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis

Current two-stage TTS framework typically integrates an acoustic model w...
research
01/25/2020

Multi-task self-supervised learning for Robust Speech Recognition

Despite the growing interest in unsupervised learning, extracting meanin...

Please sign up or login with your details

Forgot password? Click here to reset