Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning

06/21/2022
by   Shuaicheng Li, et al.
0

Video highlight detection is a crucial yet challenging problem that aims to identify the interesting moments in untrimmed videos. The key to this task lies in effective video representations that jointly pursue two goals, i.e., cross-modal representation learning and fine-grained feature discrimination. In this paper, these two challenges are tackled by not only enriching intra-modality and cross-modality relations for representation modeling but also shaping the features in a discriminative manner. Our proposed method mainly leverages the intra-modality encoding and cross-modality co-occurrence encoding for fully representation modeling. Specifically, intra-modality encoding augments the modality-wise features and dampens irrelevant modality via within-modality relation learning in both audio and visual signals. Meanwhile, cross-modality co-occurrence encoding focuses on the co-occurrence inter-modality relations and selectively captures effective information among multi-modality. The multi-modal representation is further enhanced by the global information abstracted from the local context. In addition, we enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning (HPCL) scheme. A hard-pairs sampling strategy is further employed to mine the hard samples for improving feature discrimination in HPCL. Extensive experiments conducted on two benchmarks demonstrate the effectiveness and superiority of our proposed methods compared to other state-of-the-art methods.

READ FULL TEXT
research
10/26/2022

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Multimodal representation learning is a challenging task in which previo...
research
10/19/2022

CLIP-Driven Fine-grained Text-Image Person Re-identification

TIReID aims to retrieve the image corresponding to the given text query ...
research
08/10/2020

Domain Private and Agnostic Feature for Modality Adaptive Face Recognition

Heterogeneous face recognition is a challenging task due to the large mo...
research
07/12/2023

Unified Molecular Modeling via Modality Blending

Self-supervised molecular representation learning is critical for molecu...
research
08/13/2020

Towards Modality Transferable Visual Information Representation with Optimal Model Compression

Compactly representing the visual signals is of fundamental importance i...
research
03/28/2022

S2-Net: Self-supervision Guided Feature Representation Learning for Cross-Modality Images

Combining the respective advantages of cross-modality images can compens...
research
08/21/2021

Unsupervised Local Discrimination for Medical Images

Contrastive representation learning is an effective unsupervised method ...

Please sign up or login with your details

Forgot password? Click here to reset