Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

by   Juncheng Li, et al.
HUAWEI Technologies Co., Ltd.
National University of Singapore
University of Technology Sydney
Zhejiang University
Université de Montréal

Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos (TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.


page 1

page 8


Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Temporal language grounding (TLG) aims to localize a video segment in an...

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization (WS-TAL) is a challenging...

Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization

We target at the task of weakly-supervised action localization (WSAL), w...

PDANet: Polarity-consistent Deep Attention Network for Fine-grained Visual Emotion Regression

Existing methods on visual emotion analysis mainly focus on coarse-grain...

A Multi-task Neural Approach for Emotion Attribution, Classification and Summarization

Emotional content is a crucial ingredient in user-generated videos. Howe...

Multi-Granularity Semantic Aware Graph Model for Reducing Position Bias in Emotion-Cause Pair Extraction

The Emotion-Cause Pair Extraction (ECPE) task aims to extract emotions a...

Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

Manual annotations of temporal bounds for object interactions (i.e. star...

Please sign up or login with your details

Forgot password? Click here to reset