Restoring degraded speech via a modified diffusion model

04/22/2021
by   Jianwei Zhang, et al.
0

There are many deterministic mathematical operations (e.g. compression, clipping, downsampling) that degrade speech quality considerably. In this paper we introduce a neural network architecture, based on a modification of the DiffWave model, that aims to restore the original speech signal. DiffWave, a recently published diffusion-based vocoder, has shown state-of-the-art synthesized speech quality and relatively shorter waveform generation times, with only a small set of parameters. We replace the mel-spectrum upsampler in DiffWave with a deep CNN upsampler, which is trained to alter the degraded speech mel-spectrum to match that of the original speech. The model is trained using the original speech waveform, but conditioned on the degraded speech mel-spectrum. Post-training, only the degraded mel-spectrum is used as input and the model generates an estimate of the original speech. Our model results in improved speech quality (original DiffWave model as baseline) on several different experiments. These include improving the quality of speech degraded by LPC-10 compression, AMR-NB compression, and signal clipping. Compared to the original DiffWave architecture, our scheme achieves better performance on several objective perceptual metrics and in subjective comparisons. Improvements over baseline are further amplified in a out-of-corpus evaluation setting.

READ FULL TEXT
research
10/29/2018

Neural source-filter-based waveform model for statistical parametric speech synthesis

Neural waveform models such as the WaveNet are used in many recent text-...
research
06/16/2021

Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

This paper proposes a general enhancement to the Normalizing Flows (NF) ...
research
04/10/2019

RawNet: Fast End-to-End Neural Vocoder

Neural networks based vocoders have recently demonstrated the powerful a...
research
11/13/2022

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as ...
research
06/16/2019

Parametric Resynthesis with neural vocoders

Noise suppression systems generally produce output speech with copromise...
research
06/27/2022

Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities

Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional...
research
10/17/2022

TorchDIVA: An Extensible Computational Model of Speech Production built on an Open-Source Machine Learning Library

The DIVA model is a computational model of speech motor control that com...

Please sign up or login with your details

Forgot password? Click here to reset