Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Authors: Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel

What

This paper introduces a novel method for text-driven motion transfer in videos, enabling the transfer of motion from a source video to a target object specified by a text prompt, even when the source and target objects have significant differences in shape and motion characteristics.

Why

This paper pushes the boundaries of motion transfer beyond previous methods limited to similar object categories. It offers a zero-shot approach, leveraging the generative capabilities of pre-trained text-to-video diffusion models for a more versatile and accessible motion transfer solution.

How

The authors analyze the space-time features learned by a text-to-video diffusion model and introduce a novel loss function based on pairwise differences of spatial marginal mean features. This loss guides the generation process to preserve motion characteristics while accommodating significant structural deviations between source and target objects.

Result

The proposed method demonstrates state-of-the-art performance in preserving motion fidelity while adhering to the target text prompt. It outperforms existing methods in qualitative and quantitative comparisons, showcasing successful motion transfer across diverse object categories with significant shape variations. User studies further confirm the superiority of the generated videos, highlighting their improved quality and adherence to the target prompts.

LF

The method’s reliance on the pre-trained text-to-video model’s generative capabilities poses limitations. The model’s training data might not encompass all possible object-motion combinations, leading to reduced motion fidelity or artifacts. Future work could explore larger and more diverse training datasets for text-to-video models and investigate alternative optimization strategies to further enhance motion fidelity in challenging cases.

Abstract

We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video’s motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.