Fresco: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy*

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

10 Citations (Scopus)

Abstract

The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However, the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient, resulting in temporal inconsistency. In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance, our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video, significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality, coherent videos, marking a notable improvement over existing zero-shot methods.

Original languageEnglish
Pages (from-to)8703-8712
Number of pages10
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
Publication statusPublished - 2024
Externally publishedYes
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: Jun 16 2024Jun 22 2024

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

ASJC Scopus Subject Areas

  • Software
  • Computer Vision and Pattern Recognition

Keywords

  • diffusion
  • inter-frame consistency
  • intra-frame consistency
  • video-to-video translation

Cite this