Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

07/25/2023
by   Jinxiang Liu, et al.
0

The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel Audio-aware query-enhanced TRansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.

READ FULL TEXT

page 1

page 4

research
07/03/2023

AVSegFormer: Audio-Visual Segmentation with Transformer

The combination of audio and vision has long been a topic of interest in...
research
06/01/2023

Speech inpainting: Context-based speech synthesis guided by video

Audio and visual modalities are inherently connected in speech signals: ...
research
09/13/2023

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Never having seen an object and heard its sound simultaneously, can the ...
research
05/13/2021

Multi-target DoA Estimation with an Audio-visual Fusion Mechanism

Most of the prior studies in the spatial DoA domain focus on a single mo...
research
09/18/2023

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Audio-visual video segmentation (AVVS) aims to generate pixel-level maps...
research
07/02/2021

Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

There have been many studies in robotics to improve the communication sk...
research
07/25/2021

Improving Robot Localisation by Ignoring Visual Distraction

Attention is an important component of modern deep learning. However, le...

Please sign up or login with your details

Forgot password? Click here to reset