Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Jay Zhangjie Wu1 Yixiao Ge2 Xintao Wang2 Stan Weixian Lei1 Yuchao Gu1 Yufei Shi1 Wynne Hsu4 Ying Shan2 Xiaohu Qie3 Mike Zheng Shou1

1Show Lab, National University of Singapore 2ARC Lab,3Tencent PCG
4School of Computing, National University of Singapore


 

 

Abstract

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting—One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.

 

Method

Given a text-video pair (e.g., “a man is skiing”) as input, our method leverages the pretrained T2I diffusion models for T2V generation. During fine-tuning, we update the projection matrices in attention blocks using the standard diffusion training loss. During inference, we sample a novel video from the latent noise inverted from the input video, guided by an edited prompt (e.g., “Spider Man is surfing on the beach, cartoon style”).

 

Results


Pretrained T2I (Stable Diffusion)


Pretrained T2I (personalized)


Pretrained T2I (pose control)

 

Bibtex


    @article{wu2022tuneavideo,
        title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
        author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
        journal={arXiv preprint arXiv:2212.11565},
        year={2022}
    }