Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

06/05/2017
by   Jingkuan Song, et al.
0

Recent progress has been made in using attention based encoder-decoder framework for video captioning. However, most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of video captioning. To address this issue, we propose a hierarchical LSTM with adjusted temporal attention (hLSTMat) approach for video captioning. Specifically, the proposed framework utilizes the temporal attention for selecting specific frames to predict the related words, while the adjusted temporal attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the video caption generation. To demonstrate the effectiveness of our proposed framework, we test our method on two prevalent datasets: MSVD and MSR-VTT, and experimental results show that our approach outperforms the state-of-the-art methods on both two datasets.

READ FULL TEXT
research
12/26/2018

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
11/05/2019

Video Captioning with Text-based Dynamic Attention and Step-by-Step Learning

Automatically describing video content with natural language has been at...
research
01/01/2019

Not All Words are Equal: Video-specific Information Loss for Video Captioning

An ideal description for a given video should fix its gaze on salient an...
research
07/08/2018

Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video Prediction

The explosion of video data on the internet requires effective and effic...
research
08/08/2017

From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

Video captioning in essential is a complex natural process, which is aff...
research
10/16/2021

Visual-aware Attention Dual-stream Decoder for Video Captioning

Video captioning is a challenging task that captures different visual pa...
research
05/10/2019

Memory-Attended Recurrent Network for Video Captioning

Typical techniques for video captioning follow the encoder-decoder frame...

Please sign up or login with your details

Forgot password? Click here to reset