Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

09/10/2021
by   Min Peng, et al.
8

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the local-to-global multi-level interactions between the question and the video. Within each module, we introduce a multimodal attention mechanism to aid the extraction of question-video interactions, with residual connections adopted for the information passing across different levels. Through extensive experiments on three VideoQA datasets, we demonstrate better performances of the proposed method in comparison with the state-of-the-arts.

READ FULL TEXT
research
05/09/2022

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Video question answering (VideoQA) is challenging given its multimodal c...
research
02/04/2023

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

This paper presents a new method for end-to-end Video Question Answering...
research
06/24/2019

Adversarial Multimodal Network for Movie Question Answering

Visual question answering by using information from multiple modalities ...
research
04/29/2021

Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering

This paper presents a novel method, termed Bridge to Answer, to infer co...
research
07/17/2023

PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

We present in this paper a novel scheme for multimodal learning named th...
research
09/18/2022

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Recent efforts of multimodal Transformers have improved Visually Rich Do...
research
08/14/2019

Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling

Visual question answering and visual dialogue tasks have been increasing...

Please sign up or login with your details

Forgot password? Click here to reset