Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

10/01/2022
by   Bastiaan Tamm, et al.
0

Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable model parameters. We compare our proposed XLS-R-based feature extractor to a Mel-frequency cepstral coefficient (MFCC)-based one, and experiment with various combinations of bidirectional long short term memory (Bi-LSTM) and attention pooling feedforward (AttPoolFF) networks trained on the output of the feature extractors. We demonstrate the increased performance of pre-trained XLS-R embeddings in terms a reduced root mean squared error (RMSE) on the ConferencingSpeech 2022 MOS prediction task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2023

Analysis of XLS-R for Speech Quality Assessment

In online conferencing applications, estimating the perceived quality of...
research
06/09/2023

Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

Self-supervised speech models are a rapidly developing research topic in...
research
03/31/2022

An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks

Speech representations learned from Self-supervised learning (SSL) model...
research
06/02/2021

Lightweight Adapter Tuning for Multilingual Speech Translation

Adapter modules were recently introduced as an efficient alternative to ...
research
06/17/2019

Exploiting Unsupervised Pre-training and Automated Feature Engineering for Low-resource Hate Speech Detection in Polish

This paper presents our contribution to PolEval 2019 Task 6: Hate speech...
research
11/23/2021

Effect of noise suppression losses on speech distortion and ASR performance

Deep learning based speech enhancement has made rapid development toward...
research
12/21/2020

LQF: Linear Quadratic Fine-Tuning

Classifiers that are linear in their parameters, and trained by optimizi...

Please sign up or login with your details

Forgot password? Click here to reset