MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion
Authors: Roy Kapon, Guy Tevet, Daniel Cohen-Or, Amit H. Bermano
What
This paper introduces Multi-view Ancestral Sampling (MAS), a novel method for generating 3D human and animal motions using 2D diffusion models trained on in-the-wild videos.
Why
This research is significant as it allows for 3D motion generation in domains where acquiring 3D data is expensive or impractical, such as basketball, horse racing, and rhythmic gymnastics.
How
The authors first train a 2D motion diffusion model on poses extracted from videos. Then, they utilize MAS, which simultaneously generates multiple 2D views of a 3D motion via ancestral sampling, ensuring consistency across views by triangulating the generated 2D poses into a 3D motion at each denoising step.
Result
MAS successfully generates diverse and realistic 3D motions, outperforming existing pose lifting methods and a DreamFusion adaptation for unconditional motion generation. The method’s reliance on ancestral sampling results in faster generation times and avoids common issues like out-of-distribution sampling and mode collapse.
LF
Limitations include occasional character self-intersection and scale inconsistencies. Future work could address predicting global position, enabling textual control, and extending the method to multi-person interactions, hand and face motions, and complex object manipulations.
Abstract
We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation, using 2D diffusion models that were trained on motions obtained from in-the-wild videos. As such, MAS opens opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. It ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the Score Distillation approach, which optimizes each sample by repeatedly applying small fixes, our method uses a sampling process that was constructed for the diffusion framework. As we demonstrate, MAS avoids common issues such as out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/