Score Distillation Sampling with Learned Manifold Corrective
Authors: Thiemo Alldieck, Nikos Kolotouros, Cristian Sminchisescu
What
This paper presents an analysis of the Score Distillation Sampling (SDS) loss function, identifies a noise issue in its gradients, and proposes a solution called Learned Manifold Corrective SDS (LMC-SDS) to improve gradient quality and reduce reliance on high guidance weights.
Why
This paper is important because it addresses limitations of SDS, a popular method for using pre-trained diffusion models as priors in various tasks like image synthesis, editing, and 3D generation. By improving the SDS loss, it enables more stable optimization, better image fidelity, and wider applicability.
How
The authors decompose the SDS loss, identify a problematic term causing noisy gradients, and propose LMC-SDS to model and factor out the time-step dependent image corruption in the denoising process. They train a shallow network to approximate this corruption and use it to correct the gradients, promoting movement towards the manifold of natural images. They demonstrate LMC-SDS effectiveness through qualitative and quantitative experiments on image synthesis, editing, image translation network training, and 3D asset generation.
Result
The proposed LMC-SDS loss leads to: 1) More stable optimization with less reliance on high guidance weights, resulting in less saturated colors and fewer artifacts. 2) Higher fidelity results in image synthesis and editing tasks, better preserving image structure while achieving significant edits. 3) Improved performance in training image-to-image translation networks, as demonstrated by the âcats-to-othersâ experiment. 4) Enhanced detail and reduced Janus problem in 3D asset generation using DreamFusion.
LF
The paper acknowledges limitations in LMC-SDS, where it might not perform well if the diffusion model doesnât understand the prompt or if the optimization strays too far from the natural image manifold. Future work includes further improving the manifold corrective and applying the findings to specific applications like text-to-3D and image editing.
Abstract
Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects. Instead, we train a shallow network mimicking the timestep-dependent denoising deficiency of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through several qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.