Identity-Preserving Talking Face Generation with Landmark and Appearance Priors

by   Weizhi Zhong, et al.

Generating talking face videos from audio attracts lots of research interest. A few person-specific methods can generate vivid videos but require the target speaker's videos for training or fine-tuning. Existing person-generic methods have difficulty in generating realistic and lip-synced videos while preserving identity information. To tackle this problem, we propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. First, we devise a novel Transformer-based landmark generator to infer lip and jaw landmarks from the audio. Prior landmark characteristics of the speaker's face are employed to make the generated landmarks coincide with the facial outline of the speaker. Then, a video rendering model is built to translate the generated landmarks into face images. During this stage, prior appearance information is extracted from the lower-half occluded target face and static reference images, which helps generate realistic and identity-preserving visual content. For effectively exploring the prior information of static reference images, we align static reference images with the target face's pose and expression based on motion fields. Moreover, auditory features are reused to guarantee that the generated face images are well synchronized with the audio. Extensive experiments demonstrate that our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.


page 1

page 4

page 7


AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Automatically generating videos in which synthesized speech is synchroni...

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

We are interested in a novel task, namely low-resource text-to-talking a...

Every Smile is Unique: Landmark-Guided Diverse Smile Generation

Each smile is unique: one person surely smiles in different ways (e.g., ...

Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose

Recent works have shown how realistic talking face images can be obtaine...

Dual-reference Face Retrieval

Face retrieval has received much attention over the past few decades, an...

APB2Face: Audio-guided face reenactment with auxiliary pose and blink signals

Audio-guided face reenactment aims at generating photorealistic faces us...

LTT-GAN: Looking Through Turbulence by Inverting GANs

In many applications of long-range imaging, we are faced with a scenario...

Please sign up or login with your details

Forgot password? Click here to reset