Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation

06/15/2023
by   Ziyang Ma, et al.
0

The excellent generalization ability of self-supervised learning (SSL) for speech foundation models has garnered significant attention. HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task. However, simply clustering features as targets by k-means does not fully inspire the model's performance. In this work, we present an unsupervised method to improve SSL targets. Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training. Our models outperform other SSL models significantly on the LibriSpeech benchmark without the need for iterative re-clustering and re-training. Furthermore, our models equipped with context-dependent units even outperform target-improvement models that use labeled data during pre-training. How we progressively improve the unit discovery process is demonstrated through experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/02/2023

Analysing Discrete Self Supervised Speech Representation for Spoken Language Modeling

This work profoundly analyzes discrete self-supervised speech representa...
research
11/14/2022

MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets

In this paper, we provide a new perspective on self-supervised speech mo...
research
09/30/2022

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

How to boost speech pre-training with textual data is an unsolved proble...
research
06/14/2021

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Self-supervised approaches for speech representation learning are challe...
research
10/12/2019

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

We propose vq-wav2vec to learn discrete representations of audio segment...
research
09/07/2021

Text-Free Prosody-Aware Generative Spoken Language Modeling

Speech pre-training has primarily demonstrated efficacy on classificatio...
research
10/19/2022

Self-supervised Heterogeneous Graph Pre-training Based on Structural Clustering

Recent self-supervised pre-training methods on Heterogeneous Information...

Please sign up or login with your details

Forgot password? Click here to reset