MVDream: Multi-view Diffusion for 3D Generation
Authors: Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang
What
This paper introduces MVDream, a multi-view diffusion model that addresses the multi-view consistency issues in text-to-3D generation by leveraging large-scale 2D image datasets and 3D data for improved generalizability and consistency in generated 3D models.
Why
This work is important because it presents a novel approach to address the long-standing challenge of multi-view consistency in text-to-3D generation, which is crucial for creating high-quality and realistic 3D content. MVDream’s ability to leverage pre-trained 2D diffusion models and adapt them for multi-view consistency opens new avenues for efficient and robust 3D content creation.
How
The authors propose a multi-view diffusion model that incorporates 3D self-attention and camera embeddings into a pre-trained 2D diffusion model. They train this model on a combination of 3D rendered data and a large-scale text-to-image dataset. For 3D generation, they employ score distillation sampling (SDS), utilizing their multi-view diffusion model as a prior. They further introduce a multi-view DreamBooth technique for personalized 3D generation.
Result
MVDream demonstrates superior multi-view consistency and overall quality in generated 3D models compared to existing state-of-the-art methods. Notably, it mitigates the Janus problem (multi-face issue) commonly observed in other approaches. User studies confirm the improved robustness and quality of MVDream’s generated 3D assets. Furthermore, the model exhibits good generalization ability, effectively generating 3D content from unseen prompts and in diverse styles.
LF
The authors acknowledge limitations such as the current model’s lower resolution compared to some existing models and the potential for bias inherited from the base Stable Diffusion model. They suggest addressing these limitations by increasing the dataset size, incorporating larger base diffusion models (e.g., SDXL), and utilizing more diverse and realistic 3D rendering datasets. Future work may explore extensions for handling a larger number of non-orthogonal camera views, improving the generalizability further.
Abstract
We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.