To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem—One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.
High-level overview of Tune-A-Video. Fine-Tuning: we first extend T2I models to T2V models initialized by pretrained weights of T2I models (left). Then, we perform one-shot tuning on a text-video pair, and obtain a One-Shot T2V model (right). Inference: a modified text prompt is used to generate novel videos.
Training pipeline of Tune-A-Video. Our method takes as input a video-text pair, and only updates the projection matrices (orange) in attention blocks.
A man is surfing a wave.
A manis surfing in the desert.
A lady in yellow dress is surfing.
A policeman is surfing.
A sloth is surfing.
An astronaut is surfing.
A man is doing push-ups in the gym.
A man in red clothes is doing push-ups in the gym.
An old man is doing push-ups in the gym.
A man is doing push-ups in the forest.
A gorilla is doing push-ups in the gym.
A panda is doing push-ups in the gym.
A man is dribbling a basketball on the court.
A student is dribbling a basketball on the court.
Kobe Bryant is dribbling a volleyball on the court.
Iron Man is dribbling a basketball on the court.
Peppa Pig is dribbling a basketball on the court.
A monkey is dribbling a football on the grass.
A dog is running on the grass.
A cat is running on the grass.
A lion is running on the lawn.
A corgi dog is running in the autumn forest.
A pig is running in Time Square.
A wolf is running on the street.
A young man is running on the beach.
Iron man is running on the beach at dusk.
A girl is running on the lawn.
An old man is running on the mountain.
King Kong is running in the forest.
An astronaut is running on the sea, cartoon style.
@article{wu2022tuneavideo,
title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2212.11565},
year={2022}
}