Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

by   Lorenzo Baraldi, et al.

This paper presents a novel approach for temporal and semantic segmentation of edited videos into meaningful segments, from the point of view of the storytelling structure. The objective is to decompose a long video into more manageable sequences, which can in turn be used to retrieve the most significant parts of it given a textual query and to provide an effective summarization. Previous video decomposition methods mainly employed perceptual cues, tackling the problem either as a story change detection, or as a similarity grouping task, and the lack of semantics limited their ability to identify story boundaries. Our proposal connects together perceptual, audio and semantic cues in a specialized deep network architecture designed with a combination of CNNs which generate an appropriate embedding, and clusters shots into connected sequences of semantic scenes, i.e. stories. A retrieval presentation strategy is also proposed, by selecting the semantically and aesthetically "most valuable" thumbnails to present, considering the query in order to improve the storytelling presentation. Finally, the subjective nature of the task is considered, by conducting experiments with different annotators and by proposing an algorithm to maximize the agreement between automatic results and human annotators.


page 3

page 7

page 11

page 12


Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features

This paper presents a novel retrieval pipeline for video collections, wh...

Condensed Movies: Story Based Retrieval with Contextual Embeddings

Our objective in this work is the long range understanding of the narrat...

Semantic Video Trailers

Query-based video summarization is the task of creating a brief visual t...

Story Understanding in Video Advertisements

In order to resonate with the viewers, many video advertisements explore...

A Memory Network Approach for Story-based Temporal Summarization of 360° Videos

We address the problem of story-based temporal summarization of long 360...

Unsupervised Semantic Parsing of Video Collections

Human communication typically has an underlying structure. This is refle...

A Restricted Visual Turing Test for Deep Scene and Event Understanding

This paper presents a restricted visual Turing test (VTT) for story-line...

Please sign up or login with your details

Forgot password? Click here to reset