Audio-Driven Talking Face Generation with Diverse yet Realistic Facial Animations

by   Rongliang Wu, et al.

Audio-driven talking face generation, which aims to synthesize talking faces with realistic facial animations (including accurate lip movements, vivid facial expression details and natural head poses) corresponding to the audio, has achieved rapid progress in recent years. However, most existing work focuses on generating lip movements only without handling the closely correlated facial expressions, which degrades the realism of the generated faces greatly. This paper presents DIRFA, a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio. To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network that can model the variational facial animation distribution conditioned upon the input audio and autoregressively convert the audio signals into a facial animation sequence. In addition, we introduce a temporally-biased mask into the mapping network, which allows to model the temporal dependency of facial animations and produce temporally smooth facial animation sequence. With the generated facial animation sequence and a source image, photo-realistic talking faces can be synthesized with a generic generation network. Extensive experiments show that DIRFA can generate talking faces with realistic facial animations effectively.


page 14

page 15

page 18

page 20


Facial Keypoint Sequence Generation from Audio

Whenever we speak, our voice is accompanied by facial movements and expr...

Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

Speech-driven 3D face animation poses significant challenges due to the ...

Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention

Most of the existing audio-driven 3D facial animation methods suffered f...

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

In this paper, we introduce a simple and novel framework for one-shot au...

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Generating realistic talking faces is a complex and widely discussed tas...

PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

Generating portrait images by controlling the motions of existing faces ...

MusicFace: Music-driven Expressive Singing Face Synthesis

It is still an interesting and challenging problem to synthesize a vivid...

Please sign up or login with your details

Forgot password? Click here to reset