Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances

09/12/2018
by   Thao Minh Le, et al.
0

With the widespread use of intelligent systems, such as smart speakers, addressee recognition has become a concern in human-computer interaction, as more and more people expect such systems to understand complicated social scenes, including those outdoors, in cafeterias, and hospitals. Because previous studies typically focused only on pre-specified tasks with limited conversational situations such as controlling smart homes, we created a mock dataset called Addressee Recognition in Visual Scenes with Utterances (ARVSU) that contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee for each scenario. We also propose a multi-modal deep-learning-based model that takes different human cues, specifically eye gazes and transcripts of an utterance corpus, into account to predict the conversational addressee from a specific speaker's view in various real-life conversational scenarios. To the best of our knowledge, we are the first to introduce an end-to-end deep learning model that combines vision and transcripts of utterance for addressee recognition. As a result, our study suggests that future addressee recognition can reach the ability to understand human intention in many social situations previously unexplored, and our modality dataset is a first step in promoting research in this field.

READ FULL TEXT

page 3

page 7

research
11/25/2019

Conversational implicatures in English dialogue: Annotated dataset

Human dialogue often contains utterances having meanings entirely differ...
research
09/09/2018

Attentional Multi-Reading Sarcasm Detection

Recognizing sarcasm often requires a deep understanding of multiple sour...
research
08/21/2023

To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

Communicating shapes our social word. For a robot to be considered socia...
research
12/30/2022

X-MAS: Extremely Large-Scale Multi-Modal Sensor Dataset for Outdoor Surveillance in Real Environments

In robotics and computer vision communities, extensive studies have been...
research
05/08/2023

Toward Connecting Speech Acts and Search Actions in Conversational Search Tasks

Conversational search systems can improve user experience in digital lib...
research
08/14/2023

VoxBlink: A Large Scale Speaker Verification Dataset on Camera

In this paper, we introduce a large-scale and high-quality audio-visual ...
research
04/26/2021

Machine Learning based Lie Detector applied to a Collected and Annotated Dataset

Lie detection is considered a concern for everyone in their day to day l...

Please sign up or login with your details

Forgot password? Click here to reset