CAD: Photorealistic 3D Generation via Adversarial Distillation

Authors: Ziyu Wan, Despoina Paschalidou, Ian Huang, Hongyu Liu, Bokui Shen, Xiaoyu Xiang, Jing Liao, Leonidas Guibas

What

This paper introduces Consistent Adversarial Distillation (CAD), a novel method for synthesizing high-quality, photorealistic 3D objects from a single image and text prompt by leveraging pre-trained 2D diffusion models and addressing limitations of existing score distillation methods.

Why

This work is important because it overcomes limitations of previous 3D generation techniques, such as over-saturation, over-smoothing, and limited diversity, by directly modeling the distribution of a pre-trained diffusion model through adversarial learning, leading to higher quality and more diverse 3D object synthesis.

How

The authors propose a framework that uses a StyleGAN2-based generator to model the 3D distribution of objects, trained adversarially against a discriminator to match the distribution of a pre-trained 2D diffusion model. To ensure multi-view consistency and high-fidelity generation, they employ a two-stage training process with 2D and 3D upsampling branches, a camera pose pruning strategy for filtering inconsistent samples, and a distribution refinement step using additional diffusion models.

Result

CAD generates high-fidelity 3D objects with photorealistic textures and fewer artifacts compared to existing methods like DreamFusion, ProlificDreamer, Magic123, and Zero-1-to-3. It also demonstrates superior performance in quantitative metrics like CLIP similarity score and qualitative evaluations including a user study, highlighting its ability to produce diverse and realistic 3D objects.

LF

The authors acknowledge limitations in optimization speed due to volumetric rendering and suggest exploring efficient rendering techniques like Gaussian Splatting. They also propose future work on enabling multi-conditional generation and extending CAD to handle scene-level synthesis.

Abstract

The increased demand for 3D data in AR/VR, robotics and gaming applications, gave rise to powerful generative pipelines capable of synthesizing high-quality 3D objects. Most of these models rely on the Score Distillation Sampling (SDS) algorithm to optimize a 3D representation such that the rendered image maintains a high likelihood as evaluated by a pre-trained diffusion model. However, finding a correct mode in the high-dimensional distribution produced by the diffusion model is challenging and often leads to issues such as over-saturation, over-smoothing, and Janus-like artifacts. In this paper, we propose a novel learning paradigm for 3D synthesis that utilizes pre-trained diffusion models. Instead of focusing on mode-seeking, our method directly models the distribution discrepancy between multi-view renderings and diffusion priors in an adversarial manner, which unlocks the generation of high-fidelity and photorealistic 3D content, conditioned on a single image and prompt. Moreover, by harnessing the latent space of GANs and expressive diffusion model priors, our method facilitates a wide variety of 3D applications including single-view reconstruction, high diversity generation and continuous 3D interpolation in the open domain. The experiments demonstrate the superiority of our pipeline compared to previous works in terms of generation quality and diversity.