Generative Image Dynamics
Authors: Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski
What
This paper introduces a novel method for animating still images by predicting realistic, oscillatory motion using a learned image-space prior on scene dynamics.
Why
This work is significant because it addresses the challenge of synthesizing realistic and temporally coherent motion in videos generated from single images, which is crucial for creating believable visual content.
How
The authors leverage spectral volumes, a frequency-domain representation of motion, and train a latent diffusion model to predict these volumes from single images. They then use an image-based rendering module to animate the input image according to the predicted motion.
Result
The paper demonstrates superior quantitative and qualitative results compared to existing single-image animation methods, showing more realistic and temporally consistent video generation. The authors also showcase applications like seamless looping video generation and creating interactive dynamic images from single pictures.
LF
The authors acknowledge limitations in modeling non-oscillatory or high-frequency motions, and potential issues with thin objects or large displacements. Future work could explore learned motion bases, handle complex motion patterns, and address challenges in generating unseen content.
Abstract
We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics such as trees, flowers, candles, and clothes swaying in the wind. We model this dense, long-term motion prior in the Fourier domain:given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to realistically interact with objects in real pictures by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics.