Zero-Shot Video Captioning with Evolving Pseudo-Tokens

07/22/2022
by   Yoad Tewel, et al.
5

We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model. The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames. Unlike zero-shot image captioning methods, our work considers the entire sentence at once. This is achieved by optimizing, during the generation process, part of the prompt from scratch, by modifying the representation of all other tokens in the prompt, and by repeating the process iteratively, gradually improving the specificity and comprehensiveness of the generated sentence. Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge. Our code is available at: https://github.com/YoadTew/zero-shot-video-to-text

READ FULL TEXT

page 6

page 18

page 19

page 20

page 21

page 22

page 23

page 24

research
11/29/2021

Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Recent text-to-image matching models apply contrastive learning to large...
research
09/07/2023

Zero-Shot Audio Captioning via Audibility Guidance

The task of audio captioning is similar in essence to tasks such as imag...
research
07/05/2023

Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment

Dense video captioning, a task of localizing meaningful moments and gene...
research
08/11/2023

ZYN: Zero-Shot Reward Models with Yes-No Questions

In this work, we address the problem of directing the text generations o...
research
05/26/2023

Zero-shot Visual Question Answering with Language Model Feedback

In this paper, we propose a novel language model guided captioning appro...
research
12/17/2021

Soundify: Matching Sound Effects to Video

In the art of video editing, sound is really half the story. A skilled v...
research
05/11/2023

An Inverse Scaling Law for CLIP Training

CLIP, the first foundation model that connects images and text, has enab...

Please sign up or login with your details

Forgot password? Click here to reset