Redundancy-aware Transformer for Video Question Answering

08/07/2023
by   Yicong Li, et al.
0

This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introduces neighboring-frame redundancy that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce the cross-modal redundancy by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames, while adopting an out-of-neighboring message-passing scheme that imposes attention only on distant frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions by identifying a small subset of visual elements that exclusively support the answer. Upon these advancements, we find this Redundancy-aware transformer (RaFormer) can achieve state-of-the-art results on multiple VideoQA benchmarks.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset