METEOR Guided Divergence for Video Captioning

Automatic video captioning aims for a holistic visual scene understanding. It requires a mechanism for capturing temporal context in video frames and the ability to comprehend the actions and associations of objects in a given timeframe. Such a system should additionally learn to abstract video sequences into sensible representations as well as to generate natural written language. While the majority of captioning models focus solely on the visual inputs, little attention has been paid to the audiovisual modality. To tackle this issue, we propose a novel two-fold approach. First, we implement a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations. Second, we utilise a Bi-Modal Hierarchical Reinforcement Learning (BMHRL) Transformer architecture to capture long-term temporal dependencies of the input data as a foundation for our hierarchical captioning module. Using our BMHRL, we show the suitability of the HRL agent in the generation of content-complete and grammatically sound sentences by achieving 4.91, 2.23, and 10.80 in BLEU3, BLEU4, and METEOR scores, respectively on the ActivityNet Captions dataset. Finally, we make our BMHRL framework and trained models publicly available for users and developers at


Meaning guided video captioning

Current video captioning approaches often suffer from problems of missin...

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Neuro-symbolic representations have proved effective in learning structu...

A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

Dense video captioning aims to localize and describe important events in...

Global2Local: A Joint-Hierarchical Attention for Video Captioning

Recently, automatic video captioning has attracted increasing attention,...

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning

Training supervised video captioning model requires coupled video-captio...

Dense Video Captioning Using Unsupervised Semantic Information

We introduce a method to learn unsupervised semantic visual information ...

Automatic Generation of Descriptive Titles for Video Clips Using Deep Learning

Over the last decade, the use of Deep Learning in many applications prod...

Please sign up or login with your details

Forgot password? Click here to reset