Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Authors: Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin

What

This paper introduces TexForce, a novel method to improve text-to-image diffusion models by fine-tuning the text encoder using reinforcement learning with low-rank adaptation (LoRA) and task-specific rewards, leading to better text-image alignment and higher visual quality.

Why

This paper addresses the limitation of previous diffusion model fine-tuning methods that solely focus on the U-Net, neglecting the importance of the text encoder. It demonstrates that fine-tuning the text encoder is crucial for aligning generated images with text prompts, especially with limited training data, and shows its efficacy across different tasks and backbones.

How

The authors propose TexForce, which employs reinforcement learning, particularly the DDPO algorithm, to update the text encoder by maximizing task-specific rewards for generated images. They utilize LoRA for efficient fine-tuning and demonstrate its flexibility by combining LoRA weights from different tasks. Experiments are conducted with various prompt datasets, reward functions (ImageReward, HPSv2, face quality, hand detection confidence), and diffusion model backbones (SDv1.4, SDv1.5, SDv2.1).

Result

TexForce significantly enhances text-image alignment and visual quality across various tasks, outperforming existing methods like DPOK, ReFL, and AlignProp. It shows robust performance on different backbones and the capability to combine with U-Net fine-tuning for further improvement. GPT-4V evaluation confirms its effectiveness in both aesthetics and text-coherence. Furthermore, the fusion of LoRA weights enables enhancement of specific objects within generated images.

LF

The authors acknowledge limitations regarding sample efficiency and complexity of reward function engineering inherent to RL-based methods. They also raise concerns about potential misuse for misinformation and intellectual property infringement. Future work could address these limitations and explore broader applications of TexForce.

Abstract

Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as \textbf{TexForce}. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.