WHALETRANS: E2E WHisper to nAturaL spEech conversion using modified TRANSformer network

04/20/2020
by   Abhishek Niranjan, et al.
0

In this article, we investigate whispered-to natural-speech conversion method using sequence to sequence generation approach by proposing modified transformer architecture. We investigate different kinds of features such as mel frequency cepstral coefficients (MFCCs) and smoothed spectral features. The network is trained end-to-end (E2E) using supervised approach. We investigate the effectiveness of embedded auxillary decoder used after N encoder sub-layers, and is trained with the frame level objective function for identifying source phoneme labels. We predict target audio features and generate audio using these for testing. We test on standard wTIMIT dataset and CHAINS dataset. We report results as word-error-rate (WER) generated by using automatic speech recognition (ASR) system and also BLEU scores. using word error rate using automatic speech recognition system. In addition, we measure spectral shape of an output speech signal by measuring formant distributions w.r.t the reference speech signal, at frame level. In relation to this aspect, we also found that the whispered-to-natural converted speech formants probability distribution is closer to ground truth distribution. To the authors' best knowledge, this is the first time transformer with auxiliary decoder has been applied for whispered-to-natural speech conversion. [This pdf is TASLP submission draft version 1.0, 14th April 2020.]

READ FULL TEXT

page 6

page 7

research
04/14/2020

Transformer based Grapheme-to-Phoneme Conversion

Attention mechanism is one of the most successful techniques in deep lea...
research
10/23/2019

Correction of Automatic Speech Recognition with Transformer Sequence-to-sequence Model

In this work, we introduce a simple yet efficient post-processing model ...
research
04/29/2020

Multiresolution and Multimodal Speech Recognition with Transformers

This paper presents an audio visual automatic speech recognition (AV-ASR...
research
08/08/2020

Word Error Rate Estimation Without ASR Output: e-WER2

Measuring the performance of automatic speech recognition (ASR) systems ...
research
07/05/2022

Compute Cost Amortized Transformer for Streaming ASR

We present a streaming, Transformer-based end-to-end automatic speech re...
research
08/21/2023

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

We present TokenSplit, a speech separation model that acts on discrete t...
research
07/04/2023

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Connectionist Temporal Classification (CTC) is a widely used criterion f...

Please sign up or login with your details

Forgot password? Click here to reset