Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Linsen Song; Wayne Wu; Chaoyou Fu; Chen Change Loy; Ran He

doi:10.1109/TCSVT.2022.3210002

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Linsen Song, Wayne Wu, Chaoyou Fu, Chen Change Loy, Ran He^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

7 Citations (Scopus)

Abstract

Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.

Original language	English
Pages (from-to)	1247-1261
Number of pages	15
Journal	IEEE Transactions on Circuits and Systems for Video Technology
Volume	33
Issue number	3
DOIs	https://doi.org/10.1109/TCSVT.2022.3210002
Publication status	Published - Mar 1 2023
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 1991-2012 IEEE.

ASJC Scopus Subject Areas

Media Technology
Electrical and Electronic Engineering

Keywords

GAN
Talking face generation
thin-plate spline
video generation

Access to Document

10.1109/TCSVT.2022.3210002

Cite this

@article{6146ae621f3a4b608c26820df0902b28,

title = "Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis",

abstract = "Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.",

keywords = "GAN, Talking face generation, thin-plate spline, video generation",

author = "Linsen Song and Wayne Wu and Chaoyou Fu and Loy, \{Chen Change\} and Ran He",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2023",

month = mar,

day = "1",

doi = "10.1109/TCSVT.2022.3210002",

language = "English",

volume = "33",

pages = "1247--1261",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

AU - Song, Linsen

AU - Wu, Wayne

AU - Fu, Chaoyou

AU - Loy, Chen Change

AU - He, Ran

PY - 2023/3/1

Y1 - 2023/3/1

N2 - Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.

AB - Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.

KW - GAN

KW - Talking face generation

KW - thin-plate spline

KW - video generation

UR - http://www.scopus.com/inward/record.url?scp=85139516191&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85139516191&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2022.3210002

DO - 10.1109/TCSVT.2022.3210002

M3 - Article

AN - SCOPUS:85139516191

SN - 1051-8215

VL - 33

SP - 1247

EP - 1261

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 3

ER -

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this