AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

by   Tanvir Mahmud, et al.

An audio-visual event (AVE) is denoted by the correspondence of the visual and auditory signals in a video segment. Precise localization of the AVEs is very challenging since it demands effective multi-modal feature correspondence to ground the short and long range temporal interactions. Existing approaches struggle in capturing the different scales of multi-modal interaction due to ineffective multi-modal training strategies. To overcome this limitation, we introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer to effectively operate on different temporal scales of video frames. Our contributions are three-fold: (1) We introduce a multi-stage training framework to incorporate AudioCLIP pre-trained with audio-image pairs into the AVE localization task on video frames through contrastive fine-tuning, effective mean video feature extraction, and multi-scale training phases. (2) We propose a multi-domain attention mechanism that operates on both temporal and feature domains over varying timescales to fuse the local and global feature variations. (3) We introduce a temporal refining scheme with event-guided attention followed by a simple-yet-effective post processing step to handle significant variations of the background over diverse events. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9 approaches.


Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Recently, video object segmentation (VOS) referred by multi-modal signal...

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

To build Video Question Answering (VideoQA) systems capable of assisting...

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Leveraging both visual frames and audio has been experimentally proven e...

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

We present Video-LLaMA, a multi-modal framework that empowers Large Lang...

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Temporal action localization (TAL) is an important task extensively expl...

Learning Cross-Scale Visual Representations for Real-Time Image Geo-Localization

Robot localization remains a challenging task in GPS denied environments...

SeisCLIP: A seismology foundation model pre-trained by multi-modal data for multi-purpose seismic feature extraction

Training specific deep learning models for particular tasks is common ac...

Please sign up or login with your details

Forgot password? Click here to reset