V-SlowFast Network for Efficient Visual Sound Separation

09/18/2021
by   Lingyu Zhu, et al.
6

The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the fine-grained temporal resolution; iii) we introduce two contrastive objectives to encourage the network to learn discriminative visual features for separating sounds; iv) we propose an audio-visual global attention module for audio and visual feature fusion; v) the introduced V-SlowFast model outperforms previous state-of-the-art in single-frame based visual sound separation on small- and large-scale datasets: MUSIC-21, AVE, and VGG-Sound. We also propose a small V-SlowFast architecture variant, which achieves 74.2 81.4 page: https://ly-zhu.github.io/V-SlowFast

READ FULL TEXT

page 14

page 15

page 16

page 17

page 18

page 19

page 20

page 21

research
03/05/2021

Slow-Fast Auditory Streams For Audio Recognition

We propose a two-stream convolutional network for audio recognition, tha...
research
06/17/2021

Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention

We introduce a state-of-the-art audio-visual on-screen sound separation ...
research
03/24/2021

Repetitive Activity Counting by Sight and Sound

This paper strives for repetitive activity counting in videos. Different...
research
07/20/2022

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-...
research
08/30/2019

Recursive Visual Sound Separation Using Minus-Plus Net

Sounds provide rich semantics, complementary to visual data, for many ta...
research
07/15/2020

Separating Sounds from a Single Image

Recently, visual information has been widely used to aid the sound sourc...
research
07/14/2020

Generating Visually Aligned Sound from Videos

We focus on the task of generating sound from natural videos, and the so...

Please sign up or login with your details

Forgot password? Click here to reset