Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

by   Uttaran Bhattacharya, et al.

We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective cues. We design an affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features. We use our affective encoder in both our generator, where it learns affective features from the seed poses to guide the gesture synthesis, and our discriminator, where it enforces the synthesized gestures to contain the appropriate affective expressions. We perform extensive evaluations on two benchmark datasets for gesture synthesis from the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared to the best baselines, we improve the mean absolute joint error by 10–33 acceleration difference by 8–58 21–34 current baselines, around 15.28 gestures appear more plausible, and around 16.32 gestures had more appropriate affective expressions aligned with the speech.


page 1

page 7


DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Speech-driven gesture synthesis is a field of growing interest in virtua...

Learning Speech-driven 3D Conversational Gestures from Video

We propose the first approach to automatically and jointly synthesize bo...

MPE4G: Multimodal Pretrained Encoder for Co-Speech Gesture Generation

When virtual agents interact with humans, gestures are crucial to delive...

GesGPT: Speech Gesture Synthesis With Text Parsing from GPT

Gesture synthesis has gained significant attention as a critical researc...

Passing a Non-verbal Turing Test: Evaluating Gesture Animations Generated from Speech

In real life, people communicate using both speech and non-verbal signal...

AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis

The generation of realistic and contextually relevant co-speech gestures...

Learning to gesticulate by observation using a deep generative approach

The goal of the system presented in this paper is to develop a natural t...

Please sign up or login with your details

Forgot password? Click here to reset