StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation

Yuhan Wang; Liming Jiang; Chen Change Loy

doi:10.1109/ICCV51070.2023.02089

StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation

Yuhan Wang^*, Liming Jiang, Chen Change Loy

^*Corresponding author for this work

Nanyang Technological University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

6 Citations (Scopus)

Abstract

Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency. Code is available at https://github.com/johannwyh/StyleInV.

Original language	English
Title of host publication	Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	22794-22804
Number of pages	11
ISBN (Electronic)	9798350307184
DOIs	https://doi.org/10.1109/ICCV51070.2023.02089
Publication status	Published - 2023
Externally published	Yes
Event	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France Duration: Oct 2 2023 → Oct 6 2023

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision
ISSN (Print)	1550-5499

Conference

Conference	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/Territory	France
City	Paris
Period	10/2/23 → 10/6/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition

Access to Document

10.1109/ICCV51070.2023.02089

Cite this

Wang, Y., Jiang, L., & Loy, C. C. (2023). StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation. In Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 (pp. 22794-22804). (Proceedings of the IEEE International Conference on Computer Vision). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV51070.2023.02089

Wang, Yuhan ; Jiang, Liming ; Loy, Chen Change. / StyleInV : A Temporal Style Modulated Inversion Network for Unconditional Video Generation. Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc., 2023. pp. 22794-22804 (Proceedings of the IEEE International Conference on Computer Vision).

@inproceedings{3692714d5bc74de096e253d083200575,

title = "StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation",

abstract = "Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency. Code is available at https://github.com/johannwyh/StyleInV.",

author = "Yuhan Wang and Liming Jiang and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 ; Conference date: 02-10-2023 Through 06-10-2023",

year = "2023",

doi = "10.1109/ICCV51070.2023.02089",

language = "English",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "22794--22804",

booktitle = "Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023",

address = "United States",

}

Wang, Y, Jiang, L & Loy, CC 2023, StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation. in Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., pp. 22794-22804, 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 10/2/23. https://doi.org/10.1109/ICCV51070.2023.02089

StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation. / Wang, Yuhan; Jiang, Liming; Loy, Chen Change.
Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc., 2023. p. 22794-22804 (Proceedings of the IEEE International Conference on Computer Vision).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - StyleInV

T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

AU - Wang, Yuhan

AU - Jiang, Liming

AU - Loy, Chen Change

PY - 2023

Y1 - 2023

N2 - Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency. Code is available at https://github.com/johannwyh/StyleInV.

AB - Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency. Code is available at https://github.com/johannwyh/StyleInV.

UR - http://www.scopus.com/inward/record.url?scp=85176566828&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85176566828&partnerID=8YFLogxK

U2 - 10.1109/ICCV51070.2023.02089

DO - 10.1109/ICCV51070.2023.02089

M3 - Conference contribution

AN - SCOPUS:85176566828

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 22794

EP - 22804

BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 2 October 2023 through 6 October 2023

ER -

Wang Y, Jiang L, Loy CC. StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation. In Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc. 2023. p. 22794-22804. (Proceedings of the IEEE International Conference on Computer Vision). doi: 10.1109/ICCV51070.2023.02089

StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Access to Document

Other files and links

Cite this