FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
Authors: Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, Hongsheng Li
What
This paper introduces FouriScale, a training-free method to enable pre-trained diffusion models to synthesize high-resolution images without repetitive patterns or structural distortions.
Why
The paper addresses a critical limitation of diffusion models, which are typically trained at fixed resolutions, hindering their ability to generate high-quality images at arbitrary sizes. FouriScale offers a simple yet effective solution to this problem, making it highly relevant for various applications requiring high-resolution image generation.
How
FouriScale modifies the convolutional layers within the diffusion model’s UNet architecture. It replaces standard convolutions with a combination of dilated convolutions and low-pass filtering to achieve structural and scale consistency across resolutions. It utilizes a padding-then-cropping strategy to generate images with arbitrary aspect ratios and introduces FouriScale guidance for enhanced image quality.
Result
FouriScale effectively mitigates pattern repetition and distortions in high-resolution image synthesis, outperforming other training-free methods like Attn-Entro and ScaleCrafter. It exhibits consistent performance across different pre-trained models like SD 1.5, SD 2.1, and SDXL, demonstrating its robustness and generalizability. Quantitative evaluations using FID and KID demonstrate its superior performance over baselines.
LF
The authors acknowledge that FouriScale encounters limitations in generating ultra-high-resolution images (e.g., 4096x4096) where artifacts may arise. Additionally, its reliance on convolutional operations restricts its application to purely transformer-based diffusion models. Future work may explore extending FouriScale for ultra-high resolution and adapting it for transformer architectures.
Abstract
In this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their trained resolutions. To address this issue, we introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis. We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation, intending to achieve structural consistency and scale consistency across resolutions, respectively. Further enhanced by a padding-then-crop strategy, our method can flexibly handle text-to-image generation of various aspect ratios. By using the FouriScale as guidance, our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation. With its simplicity and compatibility, our method can provide valuable insights for future explorations into the synthesis of ultra-high-resolution images. The code will be released at https://github.com/LeonHLJ/FouriScale.