Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

11/07/2019
by   Mingbo Ma, et al.
0

Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audio with near human-level naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesize time), which grows linearly with the sentence length even with parallel approaches, and (b) the input latency in scenarios where the input text is incrementally generated (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we devise the first neural incremental TTS approach based on the recently proposed prefix-to-prefix framework. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English TTS show that our approach achieves similar speech naturalness compared to full sentence methods, but only using a fraction of time and a constant (1 - 2 words) latency.

READ FULL TEXT

page 1

page 9

research
10/15/2021

Incremental Speech Synthesis For Speech-To-Speech Translation

In a speech-to-speech translation (S2ST) pipeline, the text-to-speech (T...
research
08/07/2020

Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

Modern approaches to text to speech require the entire input character s...
research
12/23/2020

Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model

Text-to-speech (TTS) synthesis, a technique for artificially generating ...
research
09/04/2020

What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS

In incremental text to speech synthesis (iTTS), the synthesizer produces...
research
11/25/2022

Efficient Incremental Text-to-Speech on GPUs

Incremental text-to-speech, also known as streaming TTS, has been increa...
research
09/20/2023

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

Blockwise self-attentional encoder models have recently emerged as one p...
research
09/20/2023

Speak While You Think: Streaming Speech Synthesis During Text Generation

Large Language Models (LLMs) demonstrate impressive capabilities, yet in...

Please sign up or login with your details

Forgot password? Click here to reset