Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

Authors: Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, Yueqi Duan

What

This paper introduces Sherpa3D, a novel text-to-3D generation framework that leverages coarse 3D priors from 3D diffusion models to guide 2D diffusion models, achieving high-fidelity, diversified, and multi-view consistent 3D content.

Why

The paper addresses limitations in existing text-to-3D methods, which often struggle with either limited generalizability and quality (3D diffusion models) or multi-view inconsistency (2D lifting methods). Sherpa3D bridges this gap by combining the strengths of both approaches, offering a promising solution for efficient and high-quality 3D content creation.

How

Sherpa3D employs a three-stage process: 1) It generates a coarse 3D prior using a 3D diffusion model. 2) It introduces structural and semantic guidance mechanisms derived from the 3D prior to guide the 2D lifting optimization. 3) It integrates the 3D guidance with a score distillation sampling (SDS) loss, using an annealing technique to balance the influence of 3D guidance and 2D refinement. This process enables Sherpa3D to produce detailed and consistent 3D objects from text prompts.

Result

Sherpa3D demonstrates superior performance over existing text-to-3D methods in both qualitative and quantitative evaluations. It generates high-fidelity 3D assets with compelling texture quality and multi-view consistency, outperforming baselines in terms of CLIP R-Precision and user-rated quality and consistency. The authors show that Sherpa3D is efficient, taking only 25 minutes to generate a 3D model from a text prompt.

LF

The authors acknowledge that the quality of Sherpa3D’s output is inherently limited by the underlying 2D and 3D diffusion models used. Future work could explore leveraging larger, more advanced diffusion models (e.g., SDXL, DeepFloyd) to further enhance the generation quality. Additionally, the authors are interested in extending Sherpa3D’s capabilities to more complex and creative tasks, such as text-to-4D generation.

Abstract

Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.