Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization

07/12/2023
by   Kalyan Ramakrishnan, et al.
0

Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying audio-visual events, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate labels on the training data at a finer temporal resolution than at the video level and re-train the model with these labels. I.e., we determine the subset of labels for each slice of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective for the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well.

READ FULL TEXT

page 8

page 13

page 14

page 15

research
03/31/2022

Investigating Modality Bias in Audio Visual Video Parsing

We focus on the audio-visual video parsing (AVVP) problem that involves ...
research
04/19/2018

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

Audio-visual representation learning is an important task from the persp...
research
08/12/2016

Self-paced Learning for Weakly Supervised Evidence Discovery in Multimedia Event Search

Multimedia event detection has been receiving increasing attention in re...
research
06/01/2023

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

We focus on the weakly-supervised audio-visual video parsing task (AVVP)...
research
05/17/2019

Weakly-Supervised Temporal Localization via Occurrence Count Learning

We propose a novel model for temporal detection and localization which a...
research
12/21/2021

Decompose the Sounds and Pixels, Recompose the Events

In this paper, we propose a framework centering around a novel architect...
research
11/25/2019

Financial Event Extraction Using Wikipedia-Based Weak Supervision

Extraction of financial and economic events from text has previously bee...

Please sign up or login with your details

Forgot password? Click here to reset