OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

02/10/2022
by   Merey Ramazanova, et al.
7

Temporal action localization (TAL) is an important task extensively explored and improved for third-person videos in recent years. Recent efforts have been made to perform fine-grained temporal localization on first-person videos. However, current TAL methods only use visual signals, neglecting the audio modality that exists in most videos and that shows meaningful action information in egocentric videos. In this work, we take a deep look into the effectiveness of audio in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL) to leverage audio-visual information and context for egocentric TAL. For doing that, we: 1) compare and study different strategies for where and how to fuse the two modalities; 2) propose a transformer-based model to incorporate temporal audio-visual context. Our experiments show that our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.

READ FULL TEXT

page 1

page 8

research
11/01/2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

In egocentric videos, actions occur in quick succession. We capitalise o...
research
02/16/2022

When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs

We consider the task of temporal human action localization in lifestyle ...
research
06/27/2021

Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

State of the art architectures for untrimmed video Temporal Action Local...
research
10/11/2022

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

An audio-visual event (AVE) is denoted by the correspondence of the visu...
research
07/27/2018

Diagnosing Error in Temporal Action Detectors

Despite the recent progress in video understanding and the continuous ra...
research
12/20/2022

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Detecting actions in untrimmed videos should not be limited to a small, ...
research
12/12/2017

Deception Detection in Videos

We present a system for covert automated deception detection in real-lif...

Please sign up or login with your details

Forgot password? Click here to reset