Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

by   Yu Zhao, et al.

Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.


page 1

page 2

page 8

page 10

page 11

page 12


Spatio-Temporal Scene Graphs for Video Dialog

The Audio-Visual Scene-aware Dialog (AVSD) task requires an agent to ind...

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Spatio-temporal scene-graph approaches to video-based reasoning tasks su...

ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

Building benchmarks to systemically analyze different capabilities of vi...

LABRAD-OR: Lightweight Memory Scene Graphs for Accurate Bimodal Reasoning in Dynamic Operating Rooms

Modern surgeries are performed in complex and dynamic settings, includin...

Unified Graph Structured Models for Video Understanding

Accurate video understanding involves reasoning about the relationships ...

Large-Scale Automatic Labeling of Video Events with Verbs Based on Event-Participant Interaction

We present an approach to labeling short video clips with English verbs ...

Fine-Grained Temporal Relation Extraction

We present a novel semantic framework for modeling temporal relations an...

Please sign up or login with your details

Forgot password? Click here to reset