ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

by   Zhou Yu, et al.

Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the fine-grained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene graphs. The fine-grained properties of ANetQA are reflected in the following: (i) untrimmed videos with fine-grained semantics; (ii) spatio-temporal scene graphs with fine-grained taxonomies; and (iii) diverse questions generated from fine-grained templates. ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos. Comprehensive experiments are performed for state-of-the-art methods. The best model achieves 44.5 performance tops out at 84.5


page 1

page 4

page 11

page 13

page 14

page 17

page 18


AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Visual events are a composition of temporal actions involving actors spa...

AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal Reasoning

Prior benchmarks have analyzed models' answers to questions about videos...

KQA Pro: A Large Diagnostic Dataset for Complex Question Answering over Knowledge Base

Complex question answering over knowledge base (Complex KBQA) is challen...

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Video-grounded dialogues are very challenging due to (i) the complexity ...

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Video Semantic Role Labeling (VidSRL) aims to detect the salient events ...

RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought

Large language Models (LLMs) have achieved promising performance on arit...

Scaling Fine-grained Modularity Clustering for Massive Graphs

Modularity clustering is an essential tool to understand complicated gra...

Please sign up or login with your details

Forgot password? Click here to reset