Shot Contrastive Self-Supervised Learning for Scene Boundary Detection

by   Shixing Chen, et al.

Scenes play a crucial role in breaking the storyline of movies and TV episodes into semantically cohesive parts. However, given their complex temporal structure, finding scene boundaries can be a challenging task requiring large amounts of labeled training data. To address this challenge, we present a self-supervised shot contrastive learning approach (ShotCoL) to learn a shot representation that maximizes the similarity between nearby shots compared to randomly selected shots. We show how to apply our learned shot representation for the task of scene boundary detection to offer state-of-the-art performance on the MovieNet dataset while requiring only  25 of the training labels, using 9x fewer model parameters and offering 7x faster runtime. To assess the effectiveness of ShotCoL on novel applications of scene boundary detection, we take on the problem of finding timestamps in movies and TV episodes where video-ads can be inserted while offering a minimally disruptive viewing experience. To this end, we collected a new dataset called AdCuepoints with 3,975 movies and TV episodes, 2.2 million shots and 19,119 minimally disruptive ad cue-point labels. We present a thorough empirical analysis on this dataset demonstrating the effectiveness of ShotCoL for ad cue-points detection.


page 1

page 3

page 4

page 5

page 14

page 16


Scene Consistency Representation Learning for Video Scene Segmentation

A long-term video, such as a movie or TV show, is composed of various sc...

Boundary-aware Self-supervised Learning for Video Scene Segmentation

Self-supervised learning has drawn attention through its effectiveness i...

Watching Too Much Television is Good: Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows

The abundance and ease of utilizing sound, along with the fact that audi...

Movies2Scenes: Learning Scene Representations Using Movie Similarities

Automatic understanding of movie-scenes is an important problem with mul...

Few-Max: Few-Shot Domain Adaptation for Unsupervised Contrastive Representation Learning

Contrastive self-supervised learning methods learn to map data points su...

FROB: Few-shot ROBust Model for Classification and Out-of-Distribution Detection

Nowadays, classification and Out-of-Distribution (OoD) detection in the ...

Serial Speakers: a Dataset of TV Series

For over a decade, TV series have been drawing increasing interest, both...

Please sign up or login with your details

Forgot password? Click here to reset