SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

07/23/2023
by   Iván Vallés-Pérez, et al.
0

Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91 at random, while providing substantial robustness against different kinds of noise, with a 10 noise. We also provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications, such as intelligibility evaluation and the ability to leverage rich pre-trained phonetic embeddings in speech generation task. Finally, we discuss potential applications with interesting implications for the speech generation and recognition fields.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2023

On the Transferability of Whisper-based Representations for "In-the-Wild" Cross-Task Downstream Speech Applications

Large self-supervised pre-trained speech models have achieved remarkable...
research
09/01/2023

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

For fine-grained generation and recognition tasks such as minimally-supe...
research
03/03/2023

Pre-trained Model Representations and their Robustness against Noise for Speech Emotion Analysis

Pre-trained model representations have demonstrated state-of-the-art per...
research
04/03/2020

Analyzing autoencoder-based acoustic word embeddings

Recent studies have introduced methods for learning acoustic word embedd...
research
02/22/2023

Contrastive Representation Learning for Acoustic Parameter Estimation

A study is presented in which a contrastive learning approach is used to...
research
10/25/2022

Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Recovering the masked speech frames is widely applied in speech represen...

Please sign up or login with your details

Forgot password? Click here to reset