Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

by   Ryandhimas E. Zezario, et al.

In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. More specifically, MOSA-Net is designed to estimate the speech quality, intelligibility, and distortion assessment scores of an input test speech signal. It comprises a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture for representation extraction, and a multiplicative attention layer and a fully-connected layer for each assessment metric. In addition, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned models are used as inputs to combine rich acoustic information from different speech representations to obtain more accurate assessments. Experimental results show that MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI) scores when tested on noisy and enhanced speech utterances under either seen test conditions or unseen test conditions. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test.


page 1

page 7

page 10


STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model

The calculation of most objective speech intelligibility assessment metr...

HASA-net: A non-intrusive hearing-aid speech assessment network

Without the need of a clean reference, non-intrusive speech assessment m...

Investigating Cross-Domain Losses for Speech Enhancement

Recent years have seen a surge in the number of available frameworks for...

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

Recently, deep learning (DL)-based non-intrusive speech assessment model...

Improving Perceptual Quality by Phone-Fortified Perceptual Loss for Speech Enhancement

Speech enhancement (SE) aims to improve speech quality and intelligibili...

Speech Enhancement with Zero-Shot Model Selection

Recent research on speech enhancement (SE) has seen the emergence of dee...

MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids

Improving the user's hearing ability to understand speech in noisy envir...

Please sign up or login with your details

Forgot password? Click here to reset