Learning to Discretely Compose Reasoning Module Networks for Video Captioning

07/17/2020
by   Ganchao Tan, et al.
9

Generating natural language descriptions for videos, i.e., video captioning, essentially requires step-by-step reasoning along the generation process. For example, to generate the sentence "a man is shooting a basketball", we need to first locate and describe the subject "man", next reason out the man is "shooting", then describe the object "basketball" of shooting. However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process. In this paper, we propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with the above reasoning capacity. Specifically, our RMN employs 1) three sophisticated spatio-temporal reasoning modules, and 2) a dynamic and discrete module selector trained by a linguistic loss with a Gumbel approximation. Extensive experiments on MSVD and MSR-VTT datasets demonstrate the proposed RMN outperforms the state-of-the-art methods while providing an explicit and explainable generation process. Our code is available at https://github.com/tgc1997/RMN.

READ FULL TEXT

page 1

page 5

page 6

research
10/04/2022

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

Humans tend to decompose a sentence into different parts like sth do sth...
research
11/17/2022

Visual Commonsense-aware Representation Network for Video Captioning

Generating consecutive descriptions for videos, i.e., Video Captioning, ...
research
05/22/2022

GL-RG: Global-Local Representation Granularity for Video Captioning

Video captioning is a challenging task as it needs to accurately transfo...
research
05/03/2023

Visual Transformation Telling

In this paper, we propose a new visual reasoning task, called Visual Tra...
research
04/18/2019

Learning to Collocate Neural Modules for Image Captioning

We do not speak word by word from scratch; our brain quickly structures ...
research
08/08/2021

Discriminative Latent Semantic Graph for Video Captioning

Video captioning aims to automatically generate natural language sentenc...
research
12/08/2018

Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding

Grounding natural language in images essentially requires composite visu...

Please sign up or login with your details

Forgot password? Click here to reset