ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Authors: Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan

What

This paper investigates the generation of high-resolution images from pre-trained diffusion models, addressing the issue of object repetition and unreasonable structures often observed in direct high-resolution generation.

Why

This research is significant because it offers a solution to generate high-quality images at resolutions exceeding the training data, crucial for applications demanding large image sizes like advertisements, without requiring extensive retraining or fine-tuning.

How

The authors analyze the structural components of diffusion models, identifying limited convolutional receptive fields as the root cause for object repetition. They propose ‘re-dilation,’ a method to dynamically adjust the convolutional perception field during inference, and ‘convolution dispersion’ with ‘noise-damped classifier-free guidance’ to enhance generation quality at ultra-high resolutions.

Result

The proposed re-dilation method successfully mitigates object repetition issues and outperforms direct inference and attention scaling methods in terms of FID and KID scores across different Stable Diffusion versions and resolutions. The method also demonstrates superior texture detail preservation compared to a pre-trained super-resolution model. Furthermore, the approach generalizes well to text-to-video generation, enabling higher-resolution video synthesis without sacrificing image definition.

LF

The paper acknowledges limitations in evaluating texture definition using FID and KID, relying on a user preference study for assessment. Future work may explore optimizing the trade-off between image fidelity and denoising capabilities at ultra-high resolutions. Additionally, investigating the impact of re-dilation on other diffusion model applications like image editing and style transfer is suggested.

Abstract

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.