AViNet: Diving Deep into Audio-Visual Saliency Prediction

12/11/2020
by   Samyak Jain, et al.
4

We propose the AViNet architecture for audiovisual saliency prediction. AViNet is a fully convolutional encoder-decoder architecture. The encoder combines visual features learned for action recognition, with audio embeddings learned via an aural network designed to classify objects and scenes. The decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining hierarchical features. The overall architecture is conceptually simple, causal, and runs in real-time (60 fps). AViNet outperforms the state-of-the-art on ten (seven audiovisual and three visual-only) datasets while surpassing human performance on the CC, SIM, and AUC metrics for the AVE dataset. Visual features maximally account for saliency on existing datasets with audio-only contributing to minor gains, except in specific contexts like social events. Our work, therefore, motivates the need to curate saliency datasets reflective of real-life, where both the visual and aural modalities complimentarily drive saliency. Our code and pre-trained models are available at https://github.com/samyak0210/VideoSaliency

READ FULL TEXT

page 1

page 3

page 7

page 8

research
03/10/2020

Tidying Deep Saliency Prediction Architectures

Learning computational models for visual attention (saliency estimation)...
research
05/25/2019

DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction

This paper presents a conceptually simple and effective Deep Audio-Visua...
research
05/10/2021

Temporal-Spatial Feature Pyramid for Video Saliency Detection

In this paper, we propose a 3D fully convolutional encoder-decoder archi...
research
08/25/2020

FastSal: a Computationally Efficient Network for Visual Saliency Prediction

This paper focuses on the problem of visual saliency prediction, predict...
research
10/27/2022

Predicting Visual Attention and Distraction During Visual Search Using Convolutional Neural Networks

Most studies in computational modeling of visual attention encompass tas...
research
02/18/2019

Contextual Encoder-Decoder Network for Visual Saliency Prediction

Predicting salient regions in natural images requires the detection of o...
research
03/11/2023

CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective

Incorporating the audio stream enables Video Saliency Prediction (VSP) t...

Please sign up or login with your details

Forgot password? Click here to reset