How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

by   Mantas Mazeika, et al.

In recent years, deep neural networks have demonstrated increasingly strong abilities to recognize objects and activities in videos. However, as video understanding becomes widely used in real-world applications, a key consideration is developing human-centric systems that understand not only the content of the video but also how it would affect the wellbeing and emotional state of viewers. To facilitate research in this setting, we introduce two large-scale datasets with over 60,000 videos manually annotated for emotional response and subjective wellbeing. The Video Cognitive Empathy (VCE) dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states. The Video to Valence (V2V) dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing. In experiments, we show how video models that are primarily trained to recognize actions and find contours of objects can be repurposed to understand human preferences and the emotional content of videos. Although there is room for improvement, predicting wellbeing and emotional response is on the horizon for state-of-the-art models. We hope our datasets can help foster further advances at the intersection of commonsense video understanding and human preference learning.


page 2

page 4

page 5

page 7


EEV Dataset: Predicting Expressions Evoked by Diverse Videos

When we watch videos, the visual and auditory information we experience ...

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

Recognizing the emotional state of people is a basic but challenging tas...

Use of Affective Visual Information for Summarization of Human-Centric Videos

Increasing volume of user-generated human-centric video content and thei...

Holistic Large Scale Video Understanding

Action recognition has been advanced in recent years by benchmarks with ...

A Blast From the Past: Personalizing Predictions of Video-Induced Emotions using Personal Memories as Context

A key challenge in the accurate prediction of viewers' emotional respons...

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Shots are key narrative elements of various videos, e.g. movies, TV seri...

Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies

Stories can have tremendous power -- not only useful for entertainment, ...

Please sign up or login with your details

Forgot password? Click here to reset