Temporally Coherent Bayesian Models for Entity Discovery in Videos by Tracklet Clustering

by   Adway Mitra, et al.

A video can be represented as a sequence of tracklets, each spanning 10-20 frames, and associated with one entity (eg. a person). The task of Entity Discovery in videos can be naturally posed as tracklet clustering. We approach this task by leveraging Temporal Coherence(TC): the fundamental property of videos that each tracklet is likely to be associated with the same entity as its temporal neighbors. Our major contributions are the first Bayesian nonparametric models for TC at tracklet-level. We extend Chinese Restaurant Process (CRP) to propose TC-CRP, and further to Temporally Coherent Chinese Restaurant Franchise (TC-CRF) to jointly model short temporal segments. On the task of discovering persons in TV serial videos without meta-data like scripts, these methods show considerable improvement in cluster purity and person coverage compared to state-of-the-art approaches to tracklet clustering. We represent entities with mixture components, and tracklets with vectors of very generic features, which can work for any type of entity (not necessarily person). The proposed methods can perform online tracklet clustering on streaming videos with little performance deterioration unlike existing approaches, and can automatically reject tracklets resulting from false detections. Finally we discuss entity-driven video summarization- where some temporal segments of the video are selected automatically based on the discovered entities.


page 2

page 6

page 8

page 10


MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware Meta-learning

Dancing video retargeting aims to synthesize a video that transfers the ...

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

While most modern video understanding models operate on short-range clip...

Self-Contained Entity Discovery from Captioned Videos

This paper introduces the task of visual named entity discovery in video...

For Your Eyes Only: Learning to Summarize First-Person Videos

With the increasing amount of video data, it is desirable to highlight o...

SmartTennisTV: Automatic indexing of tennis videos

In this paper, we demonstrate a score based indexing approach for tennis...

Combining Spans into Entities: A Neural Two-Stage Approach for Recognizing Discontiguous Entities

In medical documents, it is possible that an entity of interest not only...

Please sign up or login with your details

Forgot password? Click here to reset