MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation

Kaisiyuan Wang; Qianyi Wu; Linsen Song; Zhuoqian Yang; Wayne Wu; Chen Qian; Ran He; Yu Qiao; Chen Change Loy

doi:10.1007/978-3-030-58589-1_42

MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation

Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu^*, Chen Qian, Ran He, Yu Qiao, Chen Change Loy

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

196 Citations (Scopus)

Abstract

The synthesis of natural emotional reactions is an essential criterion in vivid talking-face video generation. This criterion is nevertheless seldom taken into consideration in previous works due to the absence of a large-scale, high-quality emotional audio-visual dataset. To address this issue, we build the Multi-view Emotional Audio-visual Dataset (MEAD), a talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three different intensity levels. High-quality audio-visual clips are captured at seven different view angles in a strictly-controlled environment. Together with the dataset, we release an emotional talking-face generation baseline that enables the manipulation of both emotion and its intensity. Our dataset could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition. Code, model and data are publicly available on our project page^‡ ^‡https://wywu.github.io/projects/MEAD/MEAD.html.

Original language	English
Title of host publication	Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings
Editors	Andrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	700-717
Number of pages	18
ISBN (Print)	9783030585884
DOIs	https://doi.org/10.1007/978-3-030-58589-1_42
Publication status	Published - 2020
Externally published	Yes
Event	16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom Duration: Aug 23 2020 → Aug 28 2020

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12366 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th European Conference on Computer Vision, ECCV 2020
Country/Territory	United Kingdom
City	Glasgow
Period	8/23/20 → 8/28/20

Bibliographical note

Publisher Copyright:
© 2020, Springer Nature Switzerland AG.

ASJC Scopus Subject Areas

Theoretical Computer Science
General Computer Science

Keywords

Generative adversarial networks
Representation disentanglement
Video generation

Access to Document

10.1007/978-3-030-58589-1_42

Cite this

Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., He, R., Qiao, Y., & Loy, C. C. (2020). MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings (pp. 700-717). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12366 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58589-1_42

Wang, Kaisiyuan ; Wu, Qianyi ; Song, Linsen et al. / MEAD : A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings. editor / Andrea Vedaldi ; Horst Bischof ; Thomas Brox ; Jan-Michael Frahm. Springer Science and Business Media Deutschland GmbH, 2020. pp. 700-717 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{8a1a33e01f9c46f582f8e00f9788d613,

title = "MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation",

abstract = "The synthesis of natural emotional reactions is an essential criterion in vivid talking-face video generation. This criterion is nevertheless seldom taken into consideration in previous works due to the absence of a large-scale, high-quality emotional audio-visual dataset. To address this issue, we build the Multi-view Emotional Audio-visual Dataset (MEAD), a talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three different intensity levels. High-quality audio-visual clips are captured at seven different view angles in a strictly-controlled environment. Together with the dataset, we release an emotional talking-face generation baseline that enables the manipulation of both emotion and its intensity. Our dataset could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition. Code, model and data are publicly available on our project page‡ ‡https://wywu.github.io/projects/MEAD/MEAD.html.",

keywords = "Generative adversarial networks, Representation disentanglement, Video generation",

author = "Kaisiyuan Wang and Qianyi Wu and Linsen Song and Zhuoqian Yang and Wayne Wu and Chen Qian and Ran He and Yu Qiao and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 16th European Conference on Computer Vision, ECCV 2020 ; Conference date: 23-08-2020 Through 28-08-2020",

year = "2020",

doi = "10.1007/978-3-030-58589-1\_42",

language = "English",

isbn = "9783030585884",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "700--717",

editor = "Andrea Vedaldi and Horst Bischof and Thomas Brox and Jan-Michael Frahm",

booktitle = "Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings",

address = "Germany",

}

Wang, K, Wu, Q, Song, L, Yang, Z, Wu, W, Qian, C, He, R, Qiao, Y & Loy, CC 2020, MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. in A Vedaldi, H Bischof, T Brox & J-M Frahm (eds), Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12366 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 700-717, 16th European Conference on Computer Vision, ECCV 2020, Glasgow, United Kingdom, 8/23/20. https://doi.org/10.1007/978-3-030-58589-1_42

MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. / Wang, Kaisiyuan; Wu, Qianyi; Song, Linsen et al.
Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings. ed. / Andrea Vedaldi; Horst Bischof; Thomas Brox; Jan-Michael Frahm. Springer Science and Business Media Deutschland GmbH, 2020. p. 700-717 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12366 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - MEAD

T2 - 16th European Conference on Computer Vision, ECCV 2020

AU - Wang, Kaisiyuan

AU - Wu, Qianyi

AU - Song, Linsen

AU - Yang, Zhuoqian

AU - Wu, Wayne

AU - Qian, Chen

AU - He, Ran

AU - Qiao, Yu

AU - Loy, Chen Change

PY - 2020

Y1 - 2020

N2 - The synthesis of natural emotional reactions is an essential criterion in vivid talking-face video generation. This criterion is nevertheless seldom taken into consideration in previous works due to the absence of a large-scale, high-quality emotional audio-visual dataset. To address this issue, we build the Multi-view Emotional Audio-visual Dataset (MEAD), a talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three different intensity levels. High-quality audio-visual clips are captured at seven different view angles in a strictly-controlled environment. Together with the dataset, we release an emotional talking-face generation baseline that enables the manipulation of both emotion and its intensity. Our dataset could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition. Code, model and data are publicly available on our project page‡ ‡https://wywu.github.io/projects/MEAD/MEAD.html.

AB - The synthesis of natural emotional reactions is an essential criterion in vivid talking-face video generation. This criterion is nevertheless seldom taken into consideration in previous works due to the absence of a large-scale, high-quality emotional audio-visual dataset. To address this issue, we build the Multi-view Emotional Audio-visual Dataset (MEAD), a talking-face video corpus featuring 60 actors and actresses talking with eight different emotions at three different intensity levels. High-quality audio-visual clips are captured at seven different view angles in a strictly-controlled environment. Together with the dataset, we release an emotional talking-face generation baseline that enables the manipulation of both emotion and its intensity. Our dataset could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition. Code, model and data are publicly available on our project page‡ ‡https://wywu.github.io/projects/MEAD/MEAD.html.

KW - Generative adversarial networks

KW - Representation disentanglement

KW - Video generation

UR - http://www.scopus.com/inward/record.url?scp=85097370836&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85097370836&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-58589-1_42

DO - 10.1007/978-3-030-58589-1_42

M3 - Conference contribution

AN - SCOPUS:85097370836

SN - 9783030585884

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 700

EP - 717

BT - Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings

A2 - Vedaldi, Andrea

A2 - Bischof, Horst

A2 - Brox, Thomas

A2 - Frahm, Jan-Michael

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 23 August 2020 through 28 August 2020

ER -

Wang K, Wu Q, Song L, Yang Z, Wu W, Qian C et al. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. In Vedaldi A, Bischof H, Brox T, Frahm JM, editors, Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. p. 700-717. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-58589-1_42

MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this