Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Jay Zhangjie Wu1, Yixiao Ge2, Xintao Wang2, Stan Weixian Lei1, Yuchao Gu1, Wynne Hsu4, Ying Shan2, Xiaohu Qie3, Mike Zheng Shou1
1Show Lab, National University of Singapore, 2ARC Lab,3Tencent PCG, 4School of Computing, National University of Singapore

Tune-A-Video - new method for text-to-video generation using one text-video pair.


To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem—One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.


High-level overview of Tune-A-Video. Fine-Tuning: we first extend T2I models to T2V models initialized by pretrained weights of T2I models (left). Then, we perform one-shot tuning on a text-video pair, and obtain a One-Shot T2V model (right). Inference: a modified text prompt is used to generate novel videos.

Training pipeline of Tune-A-Video. Our method takes as input a video-text pair, and only updates the projection matrices (orange) in attention blocks.


Training Video

Generated Video


    title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
    author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2212.11565},