NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder

by   Helard Martinez, et al.

The development of models for quality prediction of both audio and video signals is a fairly mature field. But, although several multimodal models have been proposed, the area of audio-visual quality prediction is still an emerging area. In fact, despite the reasonable performance obtained by combination and parametric metrics, currently there is no reliable pixel-based audio-visual quality metric. The approach presented in this work is based on the assumption that autoencoders, fed with descriptive audio and video features, might produce a set of features that is able to describe the complex audio and video interactions. Based on this hypothesis, we propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd). The model visual features are natural scene statistics (NSS) and spatial-temporal measures of the video component. Meanwhile, the audio features are obtained by computing the spectrogram representation of the audio component. The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer. These two layers are stacked and trained to build the deep neural network model. The model is trained and tested using a large set of stimuli, containing representative audio and video artifacts. The model performed well when tested against the UnB-AV and the LiveNetflix-II databases. that this type of approach produces quality scores that are highly correlated to subjective quality scores.


page 1

page 2

page 3


How deep is your encoder: an analysis of features descriptors for an autoencoder-based audio-visual quality metric

The development of audio-visual quality assessment models poses a number...

Perceptual Evaluation of 360 Audiovisual Quality and Machine Learning Predictions

In an earlier study, we gathered perceptual evaluations of the audio, vi...

Audio-Visual Glance Network for Efficient Video Recognition

Deep learning has made significant strides in video understanding tasks,...

AudioVMAF: Audio Quality Prediction with VMAF

Video Multimethod Assessment Fusion (VMAF) [1], [2], [3] is a popular to...

3D-MOV: Audio-Visual LSTM Autoencoder for 3D Reconstruction of Multiple Objects from Video

3D object reconstructions of transparent and concave structured objects,...

Towards a Perceived Audiovisual Quality Model for Immersive Content

This paper studies the quality of multimedia content focusing on 360 vid...

Fact sheet: Automatic Self-Reported Personality Recognition Track

We propose an informed baseline to help disentangle the various contextu...

Please sign up or login with your details

Forgot password? Click here to reset