MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

Authors: Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo

What

This paper introduces MagicTime, a novel approach for generating metamorphic time-lapse videos by incorporating physical knowledge into text-to-video generation models. It leverages time-lapse videos, which capture complete object transformations, to enhance the model’s understanding of real-world physics and enable the generation of videos depicting complex phenomena like melting, blooming, or construction.

Why

This paper is important because it addresses a significant limitation in current text-to-video generation models: the lack of encoding of real-world physical knowledge. This limitation restricts these models to generating videos with simple motions and limits their ability to depict complex, transformative processes. MagicTime tackles this issue by incorporating time-lapse video data and specialized training strategies, paving the way for more realistic and dynamic video generation.

How

The authors propose MagicTime, a framework that modifies pre-trained text-to-video diffusion models to generate metamorphic time-lapse videos. Key components include: 1) MagicAdapter: decouples spatial and temporal training to encode physical knowledge from metamorphic videos, 2) Dynamic Frames Extraction: adapts to the characteristics of time-lapse videos and prioritizes metamorphic features, and 3) Magic Text-Encoder: refines prompt understanding for metamorphic videos. Additionally, the authors create ChronoMagic, a new dataset of time-lapse videos with detailed captions, to train and evaluate MagicTime.

Result

MagicTime generates high-quality metamorphic videos that capture complex transformations and align with textual prompts. It outperforms existing text-to-video generation methods in both qualitative and quantitative evaluations, demonstrating superior visual quality, frame consistency, and text alignment. The authors also conduct ablation studies to validate the contribution of each component in MagicTime.

LF

The authors acknowledge limitations in evaluating generative models for metamorphic videos due to the lack of established metrics beyond FID, FVD, and CLIP Similarity. They plan to investigate more comprehensive evaluation metrics in future work. Additionally, the authors are exploring the integration of MagicTime with DiT-based architectures, such as Open-Sora-Plan, to further enhance metamorphic video generation capabilities.

Abstract

Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions. A largely overlooked problem in T2V is that existing models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose \textbf{MagicTime}, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic generation. First, we design a MagicAdapter scheme to decouple spatial and temporal training, encode more physical knowledge from metamorphic videos, and transform pre-trained T2V models to generate metamorphic videos. Second, we introduce a Dynamic Frames Extraction strategy to adapt to metamorphic time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes, thus embodying more physical knowledge than general videos. Finally, we introduce a Magic Text-Encoder to improve the understanding of metamorphic video prompts. Furthermore, we create a time-lapse video-text dataset called \textbf{ChronoMagic}, specifically curated to unlock the metamorphic video generation ability. Extensive experiments demonstrate the superiority and effectiveness of MagicTime for generating high-quality and dynamic metamorphic videos, suggesting time-lapse video generation is a promising path toward building metamorphic simulators of the physical world.