VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

11/28/2022
by   Kashu Yamazaki, et al.
0

Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee learnt embedding features are matched with the captions semantics. Comprehensive experiments and extensive ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior state-of-the-art methods on accuracy and diversity.

READ FULL TEXT

page 1

page 3

page 7

research
06/26/2022

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

In this paper, we leverage the human perceiving process, that involves v...
research
05/11/2020

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Generating multi-sentence descriptions for videos is one of the most cha...
research
11/19/2021

DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

Dense video captioning (DVC) aims to generate multi-sentence description...
research
04/02/2019

Context and Attribute Grounded Dense Captioning

Dense captioning aims at simultaneously localizing semantic regions and ...
research
05/12/2022

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

While there have been significant gains in the field of automated video ...
research
05/22/2022

GL-RG: Global-Local Representation Granularity for Video Captioning

Video captioning is a challenging task as it needs to accurately transfo...
research
06/26/2022

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Seas of videos are uploaded daily with the popularity of social channels...

Please sign up or login with your details

Forgot password? Click here to reset