Video Diffusion Models: A Survey

Authors: Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter

What

This paper presents a comprehensive survey of diffusion models for video generation, focusing on their applications, architectures, methods for modeling temporal dynamics, and training procedures.

Why

This survey is important due to the rapid progress and transformative potential of diffusion models in video generation. It provides a valuable resource for researchers and practitioners by summarizing key advancements, identifying trends, and highlighting remaining challenges in the field.

How

The authors conduct a systematic literature review, analyzing and categorizing existing research on video diffusion models based on various criteria. They provide a taxonomy of applications, discuss architectural choices, and delve into methods for modeling temporal dynamics. The authors also review training strategies and evaluation metrics commonly employed in this domain.

Result

Key findings include the increasing utilization of latent diffusion models for efficient, high-resolution video generation, the dominance of UNet architectures with modifications for temporal consistency, and the prevalence of pre-trained text-to-image models as backbones for video generation and editing. The survey also highlights the challenges posed by limited labeled video data and the need for better representation of temporal dependencies in videos.

LF

The authors identify several limitations and avenues for future work, including the need for larger, accurately labeled video datasets, improved methods for representing complex temporal relationships in videos, and exploration of alternative architectures capable of handling long-term temporal dependencies more effectively. Furthermore, the authors suggest exploring real-time video-to-video translation and more sophisticated video description methods beyond simple text labels.

Abstract

Diffusion generative models have recently become a robust technique for producing and modifying coherent, high-quality video. This survey offers a systematic overview of critical elements of diffusion models for video generation, covering applications, architectural choices, and the modeling of temporal dynamics. Recent advancements in the field are summarized and grouped into development trends. The survey concludes with an overview of remaining challenges and an outlook on the future of the field. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models