FIFO-Diffusion: Generating Infinite Videos from Text without Training

Authors: Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han

What

This paper presents FIFO-Diffusion, a novel inference technique based on pretrained diffusion models for generating arbitrarily long text-conditional videos without additional training.

Why

This paper is significant because it addresses the limitations of existing long video generation methods that suffer from temporal inconsistency or high computational cost, enabling the generation of high-quality, coherent videos of any length using only a pretrained model.

How

The authors introduce diagonal denoising, a process that concurrently handles multiple frames with increasing noise levels within a queue. To mitigate the training-inference discrepancy introduced by diagonal denoising, they further propose latent partitioning and lookahead denoising, which refine the noise level differences and improve denoising accuracy, respectively.

Result

FIFO-Diffusion demonstrates impressive results in generating extremely long videos (over 10,000 frames) with consistent quality and smooth motion, outperforming existing methods like FreeNoise and Gen-L-Video. It also showcases the ability to seamlessly transition between multiple prompts, enabling the creation of diverse and engaging video content.

LF

The authors acknowledge the remaining training-inference gap due to the alteration of input distribution caused by diagonal denoising. Future work includes integrating the diagonal denoising paradigm into the training process to further improve performance and reduce this gap.

Abstract

We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines.