Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

by   Junchao Zhang, et al.

Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content of video, but also capture the detailed object information. Meanwhile, video representations have great impact on the quality of generated captions. Thus, it is important for video captioning to capture salient objects with their detailed temporal dynamics, and represent them using discriminative spatio-temporal representations. In this paper, we propose a new video captioning approach based on object-aware aggregation with bidirectional temporal graph (OA-BTG), which captures detailed temporal dynamics for salient objects in video, and learns discriminative spatio-temporal representations by performing object-aware local feature aggregation on detected object regions. The main novelties and advantages are: (1) Bidirectional temporal graph: A bidirectional temporal graph is constructed along and reversely along the temporal order, which provides complementary ways to capture the temporal trajectories for each salient object. (2) Object-aware aggregation: Learnable VLAD (Vector of Locally Aggregated Descriptors) models are constructed on object temporal trajectories and global frame sequence, which performs object-aware aggregation to learn discriminative representations. A hierarchical attention mechanism is also developed to distinguish different contributions of multiple objects. Experiments on two widely-used datasets demonstrate our OA-BTG achieves state-of-the-art performance in terms of BLEU@4, METEOR and CIDEr metrics.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9


Discriminative Latent Semantic Graph for Video Captioning

Video captioning aims to automatically generate natural language sentenc...

Diverse Video Captioning by Adaptive Spatio-temporal Attention

To generate proper captions for videos, the inference needs to identify ...

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

The application of video captioning models aims at translating the conte...

Not All Words are Equal: Video-specific Information Loss for Video Captioning

An ideal description for a given video should fix its gaze on salient an...

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

The main challenge in video question answering (VideoQA) is to capture a...

Text with Knowledge Graph Augmented Transformer for Video Captioning

Video captioning aims to describe the content of videos using natural la...

Relational Graph Learning for Grounded Video Description Generation

Grounded video description (GVD) encourages captioning models to attend ...

Please sign up or login with your details

Forgot password? Click here to reset