Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video Prediction

by   Xiangxi Shi, et al.

The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people who are not able to watch the videos. Despite the great progress of video captioning research, particularly on video feature encoding, the language decoder is still largely based on the prevailing RNN decoder such as LSTM, which tends to prefer the frequent word that aligns with the video. In this paper, we propose a boundary-aware hierarchical language decoder for video captioning, which consists of a high-level GRU based language decoder, working as a global (caption-level) language model, and a low-level GRU based language decoder, working as a local (phrase-level) language model. Most importantly, we introduce a binary gate into the low-level GRU language decoder to detect the language boundaries. Together with other advanced components including joint video prediction, shared soft attention, and boundary-aware video encoding, our integrated video captioning framework can discover hierarchical language information and distinguish the subject and the object in a sentence, which are usually confusing during the language generation. Extensive experiments on two widely-used video captioning datasets, MSR-Video-to-Text (MSR-VTT) xu2016msr and YouTube-to-Text (MSVD) chen2011collecting show that our method is highly competitive, compared with the state-of-the-art methods.


page 7

page 8


Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Recent progress has been made in using attention based encoder-decoder f...

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning

Our winning entry for the CVPR 2023 Generic Event Boundary Captioning (G...

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sent...

Hierarchical Boundary-Aware Neural Encoder for Video Captioning

The use of Recurrent Neural Networks for video captioning has recently g...

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

In this work, we introduce Vid2Seq, a multi-modal single-stage dense eve...

Hierarchical Memory Decoding for Video Captioning

Recent advances of video captioning often employ a recurrent neural netw...

VideoBERT: A Joint Model for Video and Language Representation Learning

Self-supervised learning has become increasingly important to leverage t...

Please sign up or login with your details

Forgot password? Click here to reset