Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

Authors: Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, Jaakko Lehtinen

What

This paper introduces a novel technique for enhancing image generation in diffusion models by strategically limiting the application of classifier-free guidance (CFG) to a specific interval of noise levels during the sampling process.

Why

This research is significant as it addresses the sub-optimal performance of traditional CFG, which applies a constant guidance weight throughout the sampling process, leading to limitations in image quality and inference speed. By confining CFG to a specific noise level range, this technique allows for higher guidance weights, resulting in substantial improvements in image fidelity and a reduction in computational cost.

How

The authors begin by analyzing the impact of CFG at different noise levels using both a theoretical framework and empirical observations. They demonstrate that CFG is detrimental at high noise levels, largely unnecessary at low levels, and most beneficial in the middle stages of the sampling chain. Based on this insight, they propose a modified ODE for diffusion model sampling where guidance is applied only within a specific noise level interval. They evaluate their approach using quantitative metrics (FID, FDDINO) on ImageNet-512 and provide a qualitative analysis of the generated images using both ImageNet and Stable Diffusion XL. Ablation studies are performed to demonstrate the impact of varying guidance intervals and weights.

Result

The proposed method achieves state-of-the-art FID scores on ImageNet-512, surpassing previous records by a significant margin. Notably, with their method, FID improves from 2.23 to 1.68 using EDM2-S and from 1.81 to 1.40 using EDM2-XXL. Qualitative results demonstrate that limiting the guidance interval preserves image diversity and reduces color saturation artifacts commonly observed with high guidance weights in standard CFG. The technique is shown to be effective across different sampler parameters, network architectures, and datasets, including Stable Diffusion XL.

LF

The authors acknowledge that while their method significantly improves performance, future work could explore automatically determining the optimal guidance interval directly from the ODE. Additionally, further research is needed to understand the role of non-ideal, trained denoisers in the context of this technique.

Abstract

Guidance is a crucial technique for extracting the best performance out of image-generating diffusion models. Traditionally, a constant guidance weight has been applied throughout the sampling chain of an image. We show that guidance is clearly harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle. We thus restrict it to a specific range of noise levels, improving both the inference speed and result quality. This limited guidance interval improves the record FID in ImageNet-512 significantly, from 1.81 to 1.40. We show that it is quantitatively and qualitatively beneficial across different sampler parameters, network architectures, and datasets, including the large-scale setting of Stable Diffusion XL. We thus suggest exposing the guidance interval as a hyperparameter in all diffusion models that use guidance.