MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

04/26/2022
by   Yuying Ge, et al.
1

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.

READ FULL TEXT

page 1

page 3

page 5

page 6

page 7

page 10

page 11

page 12

research
01/13/2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...
research
05/13/2023

Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval

Recently, masked video modeling has been widely explored and significant...
research
12/08/2021

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...
research
01/18/2023

Temporal Perceiving Video-Language Pre-training

Video-Language Pre-training models have recently significantly improved ...
research
11/19/2021

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pre-training to enable cross-moda...
research
05/23/2023

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

The ultimate goal for foundation models is realizing task-agnostic, i.e....
research
11/14/2020

ActBERT: Learning Global-Local Video-Text Representations

In this paper, we introduce ActBERT for self-supervised learning of join...

Please sign up or login with your details

Forgot password? Click here to reset