Test of Time: Instilling Video-Language Models with a Sense of Time

01/05/2023
by   Piyush Bagad, et al.
11

Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

READ FULL TEXT

page 1

page 16

page 17

page 18

page 19

research
04/13/2023

Verbs in Action: Improving verb understanding in video-language models

Understanding verbs is crucial to modelling how people and objects inter...
research
09/15/2022

Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models

Pre-trained vision-language models (e.g., CLIP) have shown promising zer...
research
12/30/2022

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Video-language pre-training has advanced the performance of various down...
research
06/29/2021

Time-Aware Language Models as Temporal Knowledge Bases

Many facts come with an expiration date, from the name of the President ...
research
04/05/2023

VicTR: Video-conditioned Text Representations for Activity Recognition

Vision-Language models have shown strong performance in the image-domain...
research
10/03/2021

Probing Language Models for Understanding of Temporal Expressions

We present three Natural Language Inference (NLI) challenge sets that ca...
research
08/26/2023

Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models

Text-to-video (T2V) synthesis has gained increasing attention in the com...

Please sign up or login with your details

Forgot password? Click here to reset