Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance
Authors: Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, Yu Liu
What
This paper introduces Semantic-aware Classifier-Free Guidance (S-CFG), a novel approach for enhancing text-to-image diffusion models by dynamically adjusting guidance degrees for different semantic regions within an image during the denoising process.
Why
The paper addresses the limitations of the conventional global CFG scale, which often leads to spatial inconsistencies in image quality and varying semantic strengths. By customizing guidance for different semantic units, S-CFG aims to improve the overall image quality and better align the generation with text prompts.
How
The authors propose a two-step method: 1) Segmenting the latent image into semantic regions using a training-free method based on cross-attention and self-attention maps from the U-net backbone. 2) Adaptively adjusting CFG scales for each region to unify the classifier score norm, thereby balancing the amplification of various semantic units.
Result
Experiments on different diffusion models (Stable Diffusion v1.5/v2.1, DeepFloyd IF) demonstrate that S-CFG consistently outperforms the original CFG method in terms of FID-30K and CLIP Score. Qualitative results showcase notable improvements in semantic expressiveness, entity portrayal, and fine-grained details. Ablation studies highlight the effectiveness of key components like self-attention-based segmentation completion and foreground region benchmarking.
LF
The paper acknowledges that the assumption of independence among semantic units might not always hold true. Future work could explore more sophisticated methods for modeling interdependencies between regions. Further investigation into the impact of different benchmark regions and the generalizability of S-CFG to other diffusion models and downstream tasks is also suggested.
Abstract
Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.