Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

03/25/2023
by   Peng Jin, et al.
0

Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.

READ FULL TEXT

page 4

page 6

page 15

page 16

page 17

page 18

research
10/10/2022

Contrastive Video-Language Learning with Fine-grained Frame Sampling

Despite recent progress in video and language representation learning, t...
research
03/14/2022

Disentangled Representation Learning for Text-Video Retrieval

Cross-modality interaction is a critical component in Text-Video Retriev...
research
04/21/2023

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Image-text retrieval, as a fundamental and important branch of informati...
research
11/01/2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Many real-world video-text tasks involve different levels of granularity...
research
01/10/2019

Reverse-Engineering Satire, or "Paper on Computational Humor Accepted Despite Making Serious Advances"

Humor is an essential human trait. Efforts to understand humor have call...
research
03/23/2023

Plug-and-Play Regulators for Image-Text Matching

Exploiting fine-grained correspondence and visual-semantic alignments ha...
research
02/27/2023

Contrastive Video Question Answering via Video Graph Transformer

We propose to perform video question answering (VideoQA) in a Contrastiv...

Please sign up or login with your details

Forgot password? Click here to reset