Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Shangchen Zhou; Peiqing Yang; Jianyi Wang; Yihang Luo; Chen Change Loy

doi:10.1109/CVPR52733.2024.00245

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, Chen Change Loy

Nanyang Technological University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

22 Citations (Scopus)

Abstract

Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

Original language	English
Title of host publication	Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Publisher	IEEE Computer Society
Pages	2535-2545
Number of pages	11
ISBN (Electronic)	9798350353006
DOIs	https://doi.org/10.1109/CVPR52733.2024.00245
Publication status	Published - 2024
Externally published	Yes
Event	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States Duration: Jun 16 2024 → Jun 22 2024

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)	1063-6919

Conference

Conference	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/Territory	United States
City	Seattle
Period	6/16/24 → 6/22/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition

Keywords

Video Diffusion Model
Video Super-Resolution

Access to Document

10.1109/CVPR52733.2024.00245

Cite this

Zhou, S., Yang, P., Wang, J., Luo, Y., & Loy, C. C. (2024). Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution. In Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 (pp. 2535-2545). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). IEEE Computer Society. https://doi.org/10.1109/CVPR52733.2024.00245

Zhou, Shangchen ; Yang, Peiqing ; Wang, Jianyi et al. / Upscale-A-Video : Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution. Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. IEEE Computer Society, 2024. pp. 2535-2545 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

@inproceedings{9912557c4bfb4474b0c060a4429ac0eb,

title = "Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution",

abstract = "Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.",

keywords = "Video Diffusion Model, Video Super-Resolution",

author = "Shangchen Zhou and Peiqing Yang and Jianyi Wang and Yihang Luo and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 ; Conference date: 16-06-2024 Through 22-06-2024",

year = "2024",

doi = "10.1109/CVPR52733.2024.00245",

language = "English",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "2535--2545",

booktitle = "Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024",

address = "United States",

}

Zhou, S, Yang, P, Wang, J, Luo, Y & Loy, CC 2024, Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution. in Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 2535-2545, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, United States, 6/16/24. https://doi.org/10.1109/CVPR52733.2024.00245

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution. / Zhou, Shangchen; Yang, Peiqing; Wang, Jianyi et al.
Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. IEEE Computer Society, 2024. p. 2535-2545 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Upscale-A-Video

T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024

AU - Zhou, Shangchen

AU - Yang, Peiqing

AU - Wang, Jianyi

AU - Luo, Yihang

AU - Loy, Chen Change

PY - 2024

Y1 - 2024

N2 - Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

AB - Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

KW - Video Diffusion Model

KW - Video Super-Resolution

UR - http://www.scopus.com/inward/record.url?scp=85196564849&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85196564849&partnerID=8YFLogxK

U2 - 10.1109/CVPR52733.2024.00245

DO - 10.1109/CVPR52733.2024.00245

M3 - Conference contribution

AN - SCOPUS:85196564849

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 2535

EP - 2545

BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024

PB - IEEE Computer Society

Y2 - 16 June 2024 through 22 June 2024

ER -

Zhou S, Yang P, Wang J, Luo Y, Loy CC. Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution. In Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. IEEE Computer Society. 2024. p. 2535-2545. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52733.2024.00245

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this