LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models

Yaohui Wang; Xinyuan Chen; Xin Ma; Shangchen Zhou; Ziqi Huang; Yi Wang; Ceyuan Yang; Yinan He; Jiashuo Yu; Peiqing Yang; Yuwei Guo; Tianxing Wu; Chenyang Si; Yuming Jiang; Cunjian Chen; Chen Change Loy; Bo Dai; Dahua Lin; Yu Qiao; Ziwei Liu

doi:10.1007/s11263-024-02295-1

LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models

Yaohui Wang^*, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin^*, Yu Qiao^*, Ziwei Liu^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

6 Citations (Scopus)

Abstract

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.

Original language	English
Pages (from-to)	3059-3078
Number of pages	20
Journal	International Journal of Computer Vision
Volume	133
Issue number	5
DOIs	https://doi.org/10.1007/s11263-024-02295-1
Publication status	Published - May 2025
Externally published	Yes

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition
Artificial Intelligence

Keywords

Diffusion models
Generative modeling
Video generation

Access to Document

10.1007/s11263-024-02295-1

Cite this

Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., Guo, Y., Wu, T., Si, C., Jiang, Y., Chen, C., Loy, C. C., Dai, B., Lin, D., Qiao, Y., & Liu, Z. (2025). LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models. International Journal of Computer Vision, 133(5), 3059-3078. https://doi.org/10.1007/s11263-024-02295-1

@article{150d31cf6e654affaf5a4ae35ea1494a,

title = "LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models",

abstract = "This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.",

keywords = "Diffusion models, Generative modeling, Video generation",

author = "Yaohui Wang and Xinyuan Chen and Xin Ma and Shangchen Zhou and Ziqi Huang and Yi Wang and Ceyuan Yang and Yinan He and Jiashuo Yu and Peiqing Yang and Yuwei Guo and Tianxing Wu and Chenyang Si and Yuming Jiang and Cunjian Chen and Loy, \{Chen Change\} and Bo Dai and Dahua Lin and Yu Qiao and Ziwei Liu",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.",

year = "2025",

month = may,

doi = "10.1007/s11263-024-02295-1",

language = "English",

volume = "133",

pages = "3059--3078",

journal = "International Journal of Computer Vision",

issn = "0920-5691",

publisher = "Springer Netherlands",

number = "5",

}

TY - JOUR

T1 - LaVie

T2 - High-Quality Video Generation with Cascaded Latent Diffusion Models

AU - Wang, Yaohui

AU - Chen, Xinyuan

AU - Ma, Xin

AU - Zhou, Shangchen

AU - Huang, Ziqi

AU - Wang, Yi

AU - Yang, Ceyuan

AU - He, Yinan

AU - Yu, Jiashuo

AU - Yang, Peiqing

AU - Guo, Yuwei

AU - Wu, Tianxing

AU - Si, Chenyang

AU - Jiang, Yuming

AU - Chen, Cunjian

AU - Loy, Chen Change

AU - Dai, Bo

AU - Lin, Dahua

AU - Qiao, Yu

AU - Liu, Ziwei

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

PY - 2025/5

Y1 - 2025/5

N2 - This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.

AB - This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.

KW - Diffusion models

KW - Generative modeling

KW - Video generation

UR - http://www.scopus.com/inward/record.url?scp=105003250612&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=105003250612&partnerID=8YFLogxK

U2 - 10.1007/s11263-024-02295-1

DO - 10.1007/s11263-024-02295-1

M3 - Article

AN - SCOPUS:105003250612

SN - 0920-5691

VL - 133

SP - 3059

EP - 3078

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

IS - 5

ER -

LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this