Everybody's Talkin': Let Me Talk as You Want

Linsen Song; Wayne Wu; Chen Qian; Ran He; Chen Change Loy

doi:10.1109/TIFS.2022.3146783

Everybody's Talkin': Let Me Talk as You Want

Linsen Song, Wayne Wu, Chen Qian, Ran He, Chen Change Loy^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

78 Citations (Scopus)

Abstract

We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.

Original language	English
Pages (from-to)	585-598
Number of pages	14
Journal	IEEE Transactions on Information Forensics and Security
Volume	17
DOIs	https://doi.org/10.1109/TIFS.2022.3146783
Publication status	Published - 2022
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2005-2012 IEEE.

ASJC Scopus Subject Areas

Safety, Risk, Reliability and Quality
Computer Networks and Communications

Keywords

audio dubbing
GAN
Talking face generation
video generation

Access to Document

10.1109/TIFS.2022.3146783

Cite this

@article{b7204d4e5d58492ea97ec44c0bfea9e7,

title = "Everybody's Talkin': Let Me Talk as You Want",

abstract = "We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.",

keywords = "audio dubbing, GAN, Talking face generation, video generation",

author = "Linsen Song and Wayne Wu and Chen Qian and Ran He and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} 2005-2012 IEEE.",

year = "2022",

doi = "10.1109/TIFS.2022.3146783",

language = "English",

volume = "17",

pages = "585--598",

journal = "IEEE Transactions on Information Forensics and Security",

issn = "1556-6013",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Everybody's Talkin'

T2 - Let Me Talk as You Want

AU - Song, Linsen

AU - Wu, Wayne

AU - Qian, Chen

AU - He, Ran

AU - Loy, Chen Change

PY - 2022

Y1 - 2022

N2 - We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.

AB - We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.

KW - audio dubbing

KW - GAN

KW - Talking face generation

KW - video generation

UR - http://www.scopus.com/inward/record.url?scp=85123783224&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85123783224&partnerID=8YFLogxK

U2 - 10.1109/TIFS.2022.3146783

DO - 10.1109/TIFS.2022.3146783

M3 - Article

AN - SCOPUS:85123783224

SN - 1556-6013

VL - 17

SP - 585

EP - 598

JO - IEEE Transactions on Information Forensics and Security

JF - IEEE Transactions on Information Forensics and Security

ER -

Everybody's Talkin': Let Me Talk as You Want

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this