The Ability of Self-Supervised Speech Models for Audio Representations

by   Tung-yu Wu, et al.

Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning, but some questions regarding their representation ability remain unanswered. This paper addresses two of them: (1) Can SSL speech models deal with non-speech audio?; (2) Would different SSL speech models have insights into diverse aspects of audio features? To answer the two questions, we conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of currently state-of-the-art SSL speech models, which are wav2vec 2.0 and HuBERT in this paper. These experiments are carried out during NeurIPS 2021 HEAR Challenge as a standard evaluation pipeline provided by competition officials. Results show that (1) SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets; (2) different SSL speech models have insights into different aspects of audio features. The two conclusions provide a foundation for the ensemble of representation models. We further propose an ensemble framework to fuse speech representation models' embeddings. Our framework outperforms state-of-the-art SSL speech/audio models and has generally superior performance on abundant datasets compared with other teams in HEAR Challenge. Our code is available at – NTU-GURA.


page 1

page 2

page 3

page 4


Audio Barlow Twins: Self-Supervised Audio Representation Learning

The Barlow Twins self-supervised learning objective requires neither neg...

Self-paced ensemble learning for speech and audio classification

Combining multiple machine learning models into an ensemble is known to ...

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

Methods for extracting audio and speech features have been studied since...

SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning

We present the SUPERB challenge at SLT 2022, which aims at learning self...

Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

Self-supervised learning general-purpose audio representations have demo...

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Recent work has explored using self-supervised learning (SSL) speech rep...

What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure

In recent times, BERT based transformer models have become an inseparabl...

Please sign up or login with your details

Forgot password? Click here to reset