AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

05/03/2023
by   Shentong Mo, et al.
0

Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks. However, there is less exploration concerning how SAM works on audio-visual tasks, such as visual sound localization and segmentation. In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio. Specifically, our AV-SAM simply leverages pixel-wise audio-visual fusion across audio features and visual features from the pre-trained image encoder in SAM to aggregate cross-modal representations. Then, the aggregated cross-modal features are fed into the prompt encoder and mask decoder to generate the final audio-visual segmentation masks. We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets. The results demonstrate that the proposed AV-SAM can achieve competitive performance on sounding object localization and segmentation.

READ FULL TEXT
research
09/13/2023

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Never having seen an object and heard its sound simultaneously, can the ...
research
05/18/2023

Annotation-free Audio-Visual Segmentation

The objective of Audio-Visual Segmentation (AVS) is to localise the soun...
research
03/30/2023

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

How does audio describe the world around us? In this paper, we propose a...
research
08/26/2021

Multi-Modulation Network for Audio-Visual Event Localization

We study the problem of localizing audio-visual events that are both aud...
research
08/16/2023

Improving Audio-Visual Segmentation with Bidirectional Generation

The aim of audio-visual segmentation (AVS) is to precisely differentiate...
research
03/23/2023

Egocentric Audio-Visual Object Localization

Humans naturally perceive surrounding scenes by unifying sound and sight...
research
04/06/2023

A Closer Look at Audio-Visual Semantic Segmentation

Audio-visual segmentation (AVS) is a complex task that involves accurate...

Please sign up or login with your details

Forgot password? Click here to reset