DeepFry: Identifying Vocal Fry Using Deep Neural Networks

03/31/2022
by   Bronya R. Chernyak, et al.
0

Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly for languages where creak is frequently used. This paper proposes a deep learning model to detect creaky voice in fluent speech. The model is composed of an encoder and a classifier trained together. The encoder takes the raw waveform and learns a representation using a convolutional neural network. The classifier is implemented as a multi-headed fully-connected network trained to detect creaky voice, voicing, and pitch, where the last two are used to refine creak prediction. The model is trained and tested on speech of American English speakers, annotated for creak by trained phoneticians. We evaluated the performance of our system using two encoders: one is tailored for the task, and the other is based on a state-of-the-art unsupervised representation. Results suggest our best-performing system has improved recall and F1 scores compared to previous methods on unseen data.

READ FULL TEXT
research
04/23/2021

Deep Learning Based Assessment of Synthetic Speech Naturalness

In this paper, we present a new objective prediction model for synthetic...
research
04/18/2019

TTS Skins: Speaker Conversion via ASR

We present a fully convolutional wav-to-wav network for converting betwe...
research
09/28/2022

MeWEHV: Mel and Wave Embeddings for Human Voice Tasks

A recent trend in speech processing is the use of embeddings created thr...
research
05/07/2020

Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

We propose Cotatron, a transcription-guided speech encoder for speaker-i...
research
11/26/2020

Towards Interpretable Multilingual Detection of Hate Speech against Immigrants and Women in Twitter at SemEval-2019 Task 5

his paper describes our techniques to detect hate speech against women a...
research
09/06/2021

Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model

An emerging trend in audio processing is capturing low-level speech repr...
research
04/07/2022

Correcting Misproducted Speech using Spectrogram Inpainting

Learning a new language involves constantly comparing speech productions...

Please sign up or login with your details

Forgot password? Click here to reset