A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

08/23/2020
by   K R Prajwal, et al.
20

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model and evaluation benchmarks on our website: <cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild>. The code and models are released at this GitHub repository: <github.com/Rudrabha/Wav2Lip>. You can also try out the interactive demo at this link: <bhaasha.iiit.ac.in/lipsync>.

READ FULL TEXT

page 1

page 4

page 8

research
12/20/2020

Visual Speech Enhancement Without A Real Visual Stream

In this work, we re-think the task of speech enhancement in unconstraine...
research
09/01/2022

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

In this work, we address the problem of generating speech from silent li...
research
03/29/2023

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Talking face generation, also known as speech-to-lip generation, reconst...
research
04/22/2021

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

While accurate lip synchronization has been achieved for arbitrary-subje...
research
08/18/2023

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

The task of lip synchronization (lip-sync) seeks to match the lips of hu...
research
02/12/2019

Puppet Dubbing

Dubbing puppet videos to make the characters (e.g. Kermit the Frog) conv...
research
11/19/2021

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. ...

Please sign up or login with your details

Forgot password? Click here to reset