VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang; Tianxing Wu; Shuai Yang; Chenyang Si; Dahua Lin; Yu Qiao; Chen Change Loy; Ziwei Liu

doi:10.1109/CVPR52733.2024.00639

VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu^*

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

14 Citations (Scopus)

Abstract

Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial in-formation refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that Video Booth achieves state-of-the-art performance in gener-ating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with only feed-forward passes.

Original language	English
Title of host publication	Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Publisher	IEEE Computer Society
Pages	6689-6700
Number of pages	12
ISBN (Electronic)	9798350353006
DOIs	https://doi.org/10.1109/CVPR52733.2024.00639
Publication status	Published - 2024
Externally published	Yes
Event	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States Duration: Jun 16 2024 → Jun 22 2024

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)	1063-6919

Conference

Conference	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/Territory	United States
City	Seattle
Period	6/16/24 → 6/22/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition

Keywords

Diffusion Models
Video Generation

Access to Document

10.1109/CVPR52733.2024.00639

Cite this

Jiang, Y., Wu, T., Yang, S., Si, C., Lin, D., Qiao, Y., Loy, C. C., & Liu, Z. (2024). VideoBooth: Diffusion-based Video Generation with Image Prompts. In Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 (pp. 6689-6700). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). IEEE Computer Society. https://doi.org/10.1109/CVPR52733.2024.00639

@inproceedings{1926c8a0334a4710b68ffa9f442400d4,

title = "VideoBooth: Diffusion-based Video Generation with Image Prompts",

abstract = "Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial in-formation refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that Video Booth achieves state-of-the-art performance in gener-ating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with only feed-forward passes.",

keywords = "Diffusion Models, Video Generation",

author = "Yuming Jiang and Tianxing Wu and Shuai Yang and Chenyang Si and Dahua Lin and Yu Qiao and Loy, \{Chen Change\} and Ziwei Liu",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 ; Conference date: 16-06-2024 Through 22-06-2024",

year = "2024",

doi = "10.1109/CVPR52733.2024.00639",

language = "English",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "6689--6700",

booktitle = "Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024",

address = "United States",

}

Jiang, Y, Wu, T, Yang, S, Si, C, Lin, D, Qiao, Y, Loy, CC & Liu, Z 2024, VideoBooth: Diffusion-based Video Generation with Image Prompts. in Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 6689-6700, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, United States, 6/16/24. https://doi.org/10.1109/CVPR52733.2024.00639

VideoBooth: Diffusion-based Video Generation with Image Prompts. / Jiang, Yuming; Wu, Tianxing; Yang, Shuai et al.
Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. IEEE Computer Society, 2024. p. 6689-6700 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - VideoBooth

T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024

AU - Jiang, Yuming

AU - Wu, Tianxing

AU - Yang, Shuai

AU - Si, Chenyang

AU - Lin, Dahua

AU - Qiao, Yu

AU - Loy, Chen Change

AU - Liu, Ziwei

PY - 2024

Y1 - 2024

N2 - Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial in-formation refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that Video Booth achieves state-of-the-art performance in gener-ating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with only feed-forward passes.

AB - Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial in-formation refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that Video Booth achieves state-of-the-art performance in gener-ating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with only feed-forward passes.

KW - Diffusion Models

KW - Video Generation

UR - http://www.scopus.com/inward/record.url?scp=85201181824&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85201181824&partnerID=8YFLogxK

U2 - 10.1109/CVPR52733.2024.00639

DO - 10.1109/CVPR52733.2024.00639

M3 - Conference contribution

AN - SCOPUS:85201181824

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 6689

EP - 6700

BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024

PB - IEEE Computer Society

Y2 - 16 June 2024 through 22 June 2024

ER -

Jiang Y, Wu T, Yang S, Si C, Lin D, Qiao Y et al. VideoBooth: Diffusion-based Video Generation with Image Prompts. In Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. IEEE Computer Society. 2024. p. 6689-6700. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52733.2024.00639

VideoBooth: Diffusion-based Video Generation with Image Prompts

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this