TextCraftor: Your Text Encoder Can be Image Quality Controller

Authors: Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, Jian Ren

What

This paper introduces TextCraftor, a novel method for enhancing text-to-image diffusion models by fine-tuning the text encoder using reward functions, leading to improved image quality and text-image alignment.

Why

This paper is important because it addresses the limitations of existing text-to-image diffusion models in generating images that accurately reflect input text prompts. It offers a more efficient alternative to replacing the entire text encoder or relying on manual prompt engineering, which are computationally expensive or require human effort.

How

The authors propose two techniques: 1) Directly fine-tuning with reward: This involves using a reward model to directly assess the quality of images generated from noisy latents. 2) Prompt-based fine-tuning: This addresses limitations of the first technique by using the denoising process to obtain a more accurate final image for reward prediction. They utilize various reward functions like aesthetic scores, text-image alignment scores, and CLIP similarity to guide the fine-tuning process.

Result

TextCraftor significantly improves image quality and text-image alignment compared to baseline models like SDv1.5 and SDv2.0, even outperforming larger models like SDXL Base 0.9 and DeepFloyd-XL in some aspects. It achieves better quantitative scores on Parti-Prompts and HPSv2 benchmarks, and human evaluations confirm the superiority of generated images. TextCraftor also enables controllable image generation through interpolation of different fine-tuned text encoders.

LF

The authors acknowledge limitations in reward models and the potential for mode collapse. They suggest exploring encoding reward function styles into text encoder tokens as future work.

Abstract

Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.