Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Authors: Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach
What
This paper introduces Stable Video Diffusion (SVD), a latent diffusion model for generating high-resolution videos from text or image prompts.
Why
This paper addresses the lack of focus on data selection in video generation research by demonstrating the significant impact of systematic data curation on the quality of generated videos, leading to state-of-the-art results in text-to-video and image-to-video synthesis.
How
The authors develop a three-stage training strategy: 1) Image pretraining using Stable Diffusion 2.1, 2) Video pretraining on a large, curated dataset at low resolution, and 3) High-resolution video finetuning on a smaller, high-quality dataset. They also employ techniques like EDM-preconditioning, classifier-free guidance, and temporal attention layers.
Result
The resulting SVD model excels at generating high-resolution videos from text and image prompts, outperforming existing models in quality and motion representation. It also demonstrates strong multi-view consistency, making it suitable for multi-view synthesis with superior results compared to specialized methods like Zero123XL and SyncDreamer.
LF
While successful in short video generation, SVD faces limitations in long-form video synthesis due to computational costs and occasional lack of motion in generated videos. Future work could explore cascaded frame generation, dedicated video tokenizers, and diffusion distillation for faster inference and long-form generation.
Abstract
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .