Video Question Generation via Cross-Modal Self-Attention Networks Learning

07/05/2019
by   Yu-Siang Wang, et al.
4

Video Question Answering (Video QA) is a critical and challenging task in multimedia comprehension. While deep learning based models are extremely capable of representing and understanding videos, these models heavily rely on massive data, which is expensive to label. In this paper, we introduce a novel task for automatically generating questions given a sequence of video frames and the corresponding subtitles from a clip of video to reduce the huge annotation cost. Learning to ask a question based on a video requires the model to comprehend the rich semantics in the scene and the interplay between the vision and the language. To address this, we propose a novel cross-modal self-attention (CMSA) network to aggregate the diverse features from video frames and subtitles. Excitingly, we demonstrate that our proposed model can improve the (strong) baseline from 0.0738 to 0.1374 in BLEU4 score -- more than 0.063 improvement (i.e., 85% relatively). Most of all, We arguably pave a novel path toward solving the challenging Video QA task and provide detailed analysis which ushers the avenues for future investigations.

READ FULL TEXT

page 2

page 7

page 8

research
08/01/2022

Video Question Answering with Iterative Video-Text Co-Tokenization

Video question answering is a challenging task that requires understandi...
research
05/10/2022

Learning to Answer Visual Questions from Web Videos

Recent methods for visual question answering rely on large-scale annotat...
research
01/27/2023

Semi-Parametric Video-Grounded Text Generation

Efficient video-language modeling should consider the computational cost...
research
05/01/2021

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Due to the severe lack of labeled data, existing methods of medical visu...
research
03/28/2022

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

In text-video retrieval, the objective is to learn a cross-modal similar...
research
06/12/2018

Attentive cross-modal paratope prediction

Antibodies are a critical part of the immune system, having the function...
research
08/07/2023

Redundancy-aware Transformer for Video Question Answering

This paper identifies two kinds of redundancy in the current VideoQA par...

Please sign up or login with your details

Forgot password? Click here to reset