FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

06/08/2020
by   Yi Ren, et al.
0

Advanced text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs during training and use predicted values during inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of full end-to-end training and even faster inference than FastSpeech. Experimental results show that 1) FastSpeech 2 and 2s outperform FastSpeech in voice quality with much simplified training pipeline and reduced training time; 2) FastSpeech 2 and 2s can match the voice quality of autoregressive models while enjoying much faster inference speed.

READ FULL TEXT
research
05/22/2019

FastSpeech: Fast, Robust and Controllable Text to Speech

Neural network based end-to-end text to speech (TTS) has significantly i...
research
11/07/2019

Teacher-Student Training for Robust Tacotron-based TTS

While neural end-to-end text-to-speech (TTS) is superior to conventional...
research
02/26/2022

Revisiting Over-Smoothness in Text to Speech

Non-autoregressive text to speech (NAR-TTS) models have attracted much a...
research
03/29/2022

Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model achi...
research
03/21/2022

Differentiable Duration Modeling for End-to-End Text-to-Speech

Parallel text-to-speech (TTS) models have recently enabled fast and high...
research
07/19/2018

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

In this work, we propose an alternative solution for parallel wave gener...
research
06/29/2021

N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement

Recently, end-to-end Korean singing voice systems have been designed to ...

Please sign up or login with your details

Forgot password? Click here to reset