Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) – Team: MMCUniAugsburg

12/28/2021
by   Philipp Harzig, et al.
0

The Multimedia and Computer Vision Lab of the University of Augsburg participated in the VTT task only. We use the VATEX and TRECVID-VTT datasets for training our VTT models. We base our model on the Transformer approach for both of our submitted runs. For our second model, we adapt the X-Linear Attention Networks for Image Captioning which does not yield the desired bump in scores. For both models, we train on the complete VATEX dataset and 90 the TRECVID-VTT dataset for pretraining while using the remaining 10 validation. We finetune both models with self-critical sequence training, which boosts the validation performance significantly. Overall, we find that training a Video-to-Text system on traditional Image Captioning pipelines delivers very poor performance. When switching to a Transformer-based architecture our results greatly improve and the generated captions match better with the corresponding video.

READ FULL TEXT
research
04/01/2022

Learning Audio-Video Modalities from Image Captions

A major challenge in text-video and text-audio retrieval is the lack of ...
research
04/06/2020

B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning

Bayesian deep neural networks (DNN) provide a mathematically grounded fr...
research
01/26/2021

CPTR: Full Transformer Network for Image Captioning

In this paper, we consider the image captioning task from a new sequence...
research
04/04/2023

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Scaling up weakly-supervised datasets has shown to be highly effective i...
research
06/01/2023

"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning

Well-formed context aware image captions and tags in enterprise content ...
research
04/29/2020

Image Captioning through Image Transformer

Automatic captioning of images is a task that combines the challenges of...
research
01/05/2023

Adaptively Clustering Neighbor Elements for Image Captioning

We design a novel global-local Transformer named Ada-ClustFormer (ACF) t...

Please sign up or login with your details

Forgot password? Click here to reset