Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

by   Jiudong Yang, et al.

Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems. Previous studies indicate that multimodal cues can facilitate this challenging task. However, due to the paucity of public multimodal datasets, current methods are mostly limited to either utilizing unimodal features or simplistic multimodal ensemble models. Besides, the inherent class imbalance in real scenario, e.g. sentence ending with short pause will be mostly regarded as the end of turn, also poses great challenge to the turn-taking decision. In this paper, we first collect a large-scale annotated corpus for turn-taking with over 5,000 real human-robot dialogues in speech and text modalities. Then, a novel gated multimodal fusion mechanism is devised to utilize various information seamlessly for turn-taking prediction. More importantly, to tackle the data imbalance issue, we design a simple yet effective data augmentation method to construct negative instances without supervision and apply contrastive learning to obtain better feature representations. Extensive experiments are conducted and the results demonstrate the superiority and competitiveness of our model over several state-of-the-art baselines.


page 1

page 2

page 3

page 4


Spiking Neural Networks for Early Prediction in Human Robot Collaboration

This paper introduces the Turn-Taking Spiking Neural Network (TTSNet), w...

Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

In human conversational interactions, turn-taking exchanges can be coord...

Applying the Wizard-of-Oz Technique to Multimodal Human-Robot Dialogue

Our overall program objective is to provide more natural ways for soldie...

Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue Systems

In this paper, we present Duplex Conversation, a multi-turn, multimodal ...

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

Compared with unimodal data, multimodal data can provide more features t...

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

It has been a hot research topic to enable machines to understand human ...

Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors

This paper tackles the challenging task of evaluating socially situated ...

Please sign up or login with your details

Forgot password? Click here to reset