Text2Performer: Text-Driven Human Video Generation

Yuming Jiang; Shuai Yang; Tong Liang Koh; Wayne Wu; Chen Change Loy; Ziwei Liu

doi:10.1109/ICCV51070.2023.02079

Text2Performer: Text-Driven Human Video Generation

Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, Ziwei Liu^*

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

19 Citations (Scopus)

Abstract

Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512 × 256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/projects/Text2Performer.html

Original language	English
Title of host publication	Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	22690-22700
Number of pages	11
ISBN (Electronic)	9798350307184
DOIs	https://doi.org/10.1109/ICCV51070.2023.02079
Publication status	Published - 2023
Externally published	Yes
Event	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France Duration: Oct 2 2023 → Oct 6 2023

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision
ISSN (Print)	1550-5499

Conference

Conference	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/Territory	France
City	Paris
Period	10/2/23 → 10/6/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition

Access to Document

10.1109/ICCV51070.2023.02079

Cite this

Jiang, Y., Yang, S., Koh, T. L., Wu, W., Loy, C. C., & Liu, Z. (2023). Text2Performer: Text-Driven Human Video Generation. In Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 (pp. 22690-22700). (Proceedings of the IEEE International Conference on Computer Vision). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV51070.2023.02079

@inproceedings{d2da0878c7e9454ba26daaddb886689a,

title = "Text2Performer: Text-Driven Human Video Generation",

abstract = "Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512 × 256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/projects/Text2Performer.html",

author = "Yuming Jiang and Shuai Yang and Koh, \{Tong Liang\} and Wayne Wu and Loy, \{Chen Change\} and Ziwei Liu",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 ; Conference date: 02-10-2023 Through 06-10-2023",

year = "2023",

doi = "10.1109/ICCV51070.2023.02079",

language = "English",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "22690--22700",

booktitle = "Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023",

address = "United States",

}

Jiang, Y, Yang, S, Koh, TL, Wu, W, Loy, CC & Liu, Z 2023, Text2Performer: Text-Driven Human Video Generation. in Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., pp. 22690-22700, 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 10/2/23. https://doi.org/10.1109/ICCV51070.2023.02079

Text2Performer: Text-Driven Human Video Generation. / Jiang, Yuming; Yang, Shuai; Koh, Tong Liang et al.
Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc., 2023. p. 22690-22700 (Proceedings of the IEEE International Conference on Computer Vision).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Text2Performer

T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

AU - Jiang, Yuming

AU - Yang, Shuai

AU - Koh, Tong Liang

AU - Wu, Wayne

AU - Loy, Chen Change

AU - Liu, Ziwei

PY - 2023

Y1 - 2023

N2 - Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512 × 256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/projects/Text2Performer.html

AB - Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512 × 256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/projects/Text2Performer.html

UR - http://www.scopus.com/inward/record.url?scp=85180386996&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85180386996&partnerID=8YFLogxK

U2 - 10.1109/ICCV51070.2023.02079

DO - 10.1109/ICCV51070.2023.02079

M3 - Conference contribution

AN - SCOPUS:85180386996

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 22690

EP - 22700

BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 2 October 2023 through 6 October 2023

ER -

Text2Performer: Text-Driven Human Video Generation

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Access to Document

Other files and links

Cite this