Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

08/31/2023
by   Linsen Song, et al.
0

Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.

READ FULL TEXT

page 2

page 4

page 7

page 9

page 10

page 11

page 12

page 15

research
06/06/2023

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

We are interested in a novel task, namely low-resource text-to-talking a...
research
01/08/2021

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

We introduce a new approach for audio-visual speech separation. Given a ...
research
12/06/2021

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning

Audio-driven one-shot talking face generation methods are usually traine...
research
05/28/2020

Investigating Correlations of Automatically Extracted Multimodal Features and Lecture Video Quality

Ranking and recommendation of multimedia content such as videos is usual...
research
03/14/2023

Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Cross-speaker style transfer in speech synthesis aims at transferring a ...
research
01/13/2020

Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders

We present an unsupervised approach that enables us to convert the speec...
research
04/03/2020

Sifter: A Hybrid Workflow for Theme-based Video Curation at Scale

User-generated content platforms curate their vast repositories into the...

Please sign up or login with your details

Forgot password? Click here to reset