DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Authors: Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan
What
DragNUWA is an open-domain, diffusion-based video generation model that introduces fine-grained control over video content using text, image, and trajectory inputs, focusing on addressing limitations in trajectory control for open-domain scenarios.
Why
This paper is important as it tackles two key limitations in existing controllable video generation models: lack of fine-grained control and limited ability to handle complex trajectories in open-domain settings. DragNUWA’s innovative approach, including Trajectory Sampler, Multiscale Fusion, and Adaptive Training, allows for more comprehensive and user-friendly control over video generation, opening new avenues for creative applications.
How
DragNUWA leverages a diffusion-based model with a multi-stage training process. First, it uses a Trajectory Sampler to extract diverse trajectories from open-domain videos. Then, a Multiscale Fusion module integrates text, image, and trajectory data at different resolutions within the UNet architecture. Finally, Adaptive Training progressively adapts the model from dense optical flow conditions to user-defined sparse trajectories, ensuring stability and consistency in video generation.
Result
DragNUWA demonstrates superior performance in fine-grained video generation. It can effectively control complex object trajectories, including curved paths and varying motion amplitudes, as well as handle camera movements like zooming in and out. The model highlights the importance of combining text, image, and trajectory inputs for achieving comprehensive control over semantic, spatial, and temporal aspects of video content.
LF
The paper does not explicitly mention limitations but implies that incorporating video as a condition is beyond the scope of this research. Future work could explore the integration of video conditions for potential advancements in style transfer. Additionally, the paper primarily focuses on visual fidelity and controllability; investigating and improving the model’s ability to generate temporally consistent and logically sound narratives could be a valuable direction for future research.
Abstract
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models’ capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}