HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

Authors: Yifan Zhang, Bryan Hooi

What

This paper introduces High-frequency-Promoting Adaptation (HiPA), a parameter-efficient method for accelerating text-to-image diffusion models to generate high-quality images in a single step by training low-rank adaptors that enhance the model’s ability to generate high-frequency details.

Why

This paper is important because it addresses the limitations of existing text-to-image diffusion models, which require many diffusion steps and thus extensive processing time. HiPA provides a solution for real-time applications that rely on fast and high-quality image generation.

How

The authors analyze the image generation process of existing diffusion models and find that one-step diffusion lacks high-frequency details crucial for realistic image synthesis. They then propose HiPA, which trains low-rank adaptors using a novel adaptation loss that combines a spatial perceptual loss and a high-frequency promoted loss. This approach encourages the model to generate images with enhanced high-frequency details in just one step.

Result

HiPA significantly outperforms previous one-step and few-step methods in terms of both image quality and training efficiency. Experiments on MS-COCO datasets demonstrate that HiPA achieves comparable results to multi-step diffusion models while being significantly faster. The method is also successfully applied to text-guided image editing, inpainting, and super-resolution, demonstrating its versatility for various real-world image generation tasks.

LF

The authors acknowledge that while HiPA significantly improves one-step generation, there is still room for further enhancement in image quality compared to multi-step diffusion models. They suggest exploring the adaptation of more advanced diffusion models, such as SD-XL and DALL-E3, as a future direction. Another limitation is the occasional presence of artifacts in the generated images, which the authors attribute, in part, to limitations inherited from the original multi-step models. As a potential solution, they propose using HiPA for generating quick drafts and then refining them using the original multi-step model for higher quality.

Abstract

Diffusion models have revolutionized text-to-image generation, but their real-world applications are hampered by the extensive time needed for hundreds of diffusion steps. Although progressive distillation has been proposed to speed up diffusion sampling to 2-8 steps, it still falls short in one-step generation, and necessitates training multiple student models, which is highly parameter-extensive and time-consuming. To overcome these limitations, we introduce High-frequency-Promoting Adaptation (HiPA), a parameter-efficient approach to enable one-step text-to-image diffusion. Grounded in the insight that high-frequency information is essential but highly lacking in one-step diffusion, HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models. The learned adaptors empower these diffusion models to generate high-quality images in just a single step. Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation (37.3 23.8 in FID-5k on MS-COCO 2017) and 28.6x training speed-up (108.8 3.8 A100 GPU days), requiring only 0.04% training parameters (7,740 million 3.3 million). We also demonstrate HiPA’s effectiveness in text-guided image editing, inpainting and super-resolution tasks, where our adapted models consistently deliver high-quality outputs in just one diffusion step. The source code will be released.