EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

12/07/2020
by   Chenfeng Miao, et al.
0

In this work, we address the Text-to-Speech (TTS) task by proposing a non-autoregressive architecture called EfficientTTS. Unlike the dominant non-autoregressive TTS models, which are trained with the need of external aligners, EfficientTTS optimizes all its parameters with a stable, end-to-end training procedure, while allowing for synthesizing high quality speech in a fast and efficient manner. EfficientTTS is motivated by a new monotonic alignment modeling approach (also introduced in this work), which specifies monotonic constraints to the sequence alignment with almost no increase of computation. By combining EfficientTTS with different feed-forward network structures, we develop a family of TTS models, including both text-to-melspectrogram and text-to-waveform networks. We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 and Glow-TTS in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. In addition, we demonstrate that proposed approach can be easily extended to autoregressive models such as Tacotron 2.

READ FULL TEXT

page 6

page 12

page 13

page 14

research
05/22/2020

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet hav...
research
07/07/2021

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

This paper describes a variational auto-encoder based non-autoregressive...
research
02/08/2023

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora ...
research
08/30/2021

Neural HMMs are all you need (for high-quality attention-free TTS)

Neural sequence-to-sequence TTS has achieved significantly better output...
research
08/23/2021

One TTS Alignment To Rule Them All

Speech-to-text alignment is a critical component of neural textto-speech...
research
04/28/2022

Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss

Recent deep learning Text-to-Speech (TTS) systems have achieved impressi...
research
05/21/2019

Parallel Neural Text-to-Speech

In this work, we propose a non-autoregressive seq2seq model that convert...

Please sign up or login with your details

Forgot password? Click here to reset