Two-stream Spatiotemporal Feature for Video QA Task

07/11/2019
by   Chiwan Song, et al.
2

Understanding the content of videos is one of the core techniques for developing various helpful applications in the real world, such as recognizing various human actions for surveillance systems or customer behavior analysis in an autonomous shop. However, understanding the content or story of the video still remains a challenging problem due to its sheer amount of data and temporal structure. In this paper, we propose a multi-channel neural network structure that adopts a two-stream network structure, which has been shown high performance in human action recognition field, and use it as a spatiotemporal video feature extractor for solving video question and answering task. We also adopt a squeeze-and-excitation structure to two-stream network structure for achieving a channel-wise attended spatiotemporal feature. For jointly modeling the spatiotemporal features from video and the textual features from the question, we design a context matching module with a level adjusting layer to remove the gap of information between visual and textual features by applying attention mechanism on joint modeling. Finally, we adopt a scoring mechanism and smoothed ranking loss objective function for selecting the correct answer from answer candidates. We evaluate our model with TVQA dataset, and our approach shows the improved result in textual only setting, but the result with visual feature shows the limitation and possibility of our approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

research
08/07/2019

STM: SpatioTemporal and Motion Encoding for Action Recognition

Spatiotemporal and motion features are two complementary and crucial inf...
research
06/02/2022

Structured Two-stream Attention Network for Video Question Answering

To date, visual question answering (VQA) (i.e., image QA and video QA) i...
research
03/04/2019

Spatiotemporal Pyramid Network for Video Action Recognition

Two-stream convolutional networks have shown strong performance in video...
research
11/07/2016

Spatiotemporal Residual Networks for Video Action Recognition

Two-stream Convolutional Networks (ConvNets) have shown strong performan...
research
03/13/2022

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

The temporal answering grounding in the video (TAGV) is a new task natur...
research
09/04/2023

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Researchers have extensively studied the field of vision and language, d...
research
07/04/2022

Automated Classification of General Movements in Infants Using a Two-stream Spatiotemporal Fusion Network

The assessment of general movements (GMs) in infants is a useful tool in...

Please sign up or login with your details

Forgot password? Click here to reset