Median-Based Generation of Synthetic Speech Durations using a Non-Parametric Approach

08/22/2016
by   Srikanth Ronanki, et al.
0

This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling -- which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis -- our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/28/2022

Expressive, Variable, and Controllable Duration Modelling in TTS

Duration modelling has become an important research problem once more wi...
research
04/16/2021

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

We propose TalkNet, a non-autoregressive convolutional neural model for ...
research
08/30/2023

The DeepZen Speech Synthesis System for Blizzard Challenge 2023

This paper describes the DeepZen text to speech (TTS) system for Blizzar...
research
03/21/2022

Differentiable Duration Modeling for End-to-End Text-to-Speech

Parallel text-to-speech (TTS) models have recently enabled fast and high...
research
10/08/2020

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-...
research
11/17/2017

Modelling dark current and hot pixels in imaging sensors

A gaussian mixture model was fitted to experimental data recorded under ...
research
04/19/2020

Consonant gemination in Italian: the affricate and fricative case

Consonant gemination in Italian affricates and fricatives was investigat...

Please sign up or login with your details

Forgot password? Click here to reset