Lumiere: A Space-Time Diffusion Model for Video Generation

Authors: Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri

What

This paper introduces Lumiere, a text-to-video diffusion model that synthesizes videos with realistic and coherent motion by generating the entire temporal duration at once using a novel Space-Time U-Net architecture.

Why

This paper addresses a critical challenge in video synthesis: generating videos with realistic and coherent motion over extended durations. It deviates from the prevalent cascaded approach and proposes a novel architecture that significantly improves the quality and coherence of generated motion in videos.

How

The authors propose a Space-Time U-Net (STUNet) that processes and generates the entire video simultaneously by downsampling and upsampling the video in both space and time. This architecture leverages a pre-trained text-to-image diffusion model and employs Multidiffusion for spatial super-resolution to ensure temporal consistency.

Result

Lumiere demonstrates state-of-the-art results in text-to-video generation, producing high-quality videos with superior motion coherence and visual fidelity compared to existing methods. It also exhibits strong performance in various downstream tasks, including image-to-video generation, video inpainting, and stylized generation.

LF

The paper acknowledges limitations in generating multi-shot videos or those involving scene transitions. Future work could explore extending Lumiere to address these limitations and investigate its application to latent video diffusion models.

Abstract

We introduce Lumiere — a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion — a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution — an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.