Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

Authors: Zhuoyi Yang, Heyang Jiang, Wenyi Hong, Jiayan Teng, Wendi Zheng, Yuxiao Dong, Ming Ding, Jie Tang

What

This paper introduces Inf-DiT, a memory-efficient diffusion transformer model for upsampling images to ultra-high resolutions by leveraging a novel Unidirectional Block Attention (UniBA) mechanism to process images in smaller blocks, thereby significantly reducing memory requirements.

Why

This work addresses the critical limitation of existing diffusion models in generating ultra-high-resolution images due to quadratic memory scaling. Inf-DiT offers a solution by enabling the generation of images at resolutions exceeding 4096x4096 pixels, which was previously infeasible due to memory constraints, opening possibilities for various applications requiring high-fidelity visuals.

How

The authors propose UniBA, which divides images into blocks and processes them sequentially in batches, minimizing the number of hidden states in memory at any given time. Inf-DiT incorporates this mechanism into a diffusion transformer architecture, utilizing techniques like global CLIP image embedding for semantic consistency and nearby LR cross-attention for local detail preservation. Trained on a dataset of high-resolution images and evaluated on benchmarks like HPDv2 and DIV2K, Inf-DiT demonstrates superior performance in image upsampling and super-resolution tasks.

Result

Inf-DiT achieves state-of-the-art performance on ultra-high resolution image generation (up to 4096x4096) as measured by FID and FIDcrop metrics, outperforming baselines like SDXL, MultiDiffusion, and DemoFusion. It also excels in classic super-resolution benchmarks on the DIV2K dataset, surpassing models like BSRGAN and StableSR. Human evaluations confirm Inf-DiT’s superiority in detail authenticity, global coherence, and consistency with low-resolution inputs. Notably, it maintains a low memory footprint, approximately 5 times lower than SDXL when generating 4096x4096 images.

LF

The authors acknowledge limitations in iterative upsampling, where errors from earlier stages can propagate and be difficult to correct in later stages. Future work could explore techniques for error correction and improved handling of long-range dependencies during iterative upsampling. Additionally, investigating the application of UniBA to other diffusion-based tasks beyond image generation could be a promising direction.

Abstract

Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 40964096), the resolution of generated images is often limited to 10241024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is https://github.com/THUDM/Inf-DiT.