Dynamic Prompt Optimizing for Text-to-Image Generation
Authors: Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen, Qing Yang
What
This paper introduces PAE, a novel two-stage framework employing reinforcement learning to automatically edit and refine text prompts for diffusion-based text-to-image synthesis, enhancing both image quality and alignment with user intent.
Why
This work addresses the challenge of manual prompt engineering in text-to-image generation. It enables fine-grained control over image generation by dynamically adjusting word importance and injection time steps in the diffusion process, leading to higher-quality images that better reflect user preferences.
How
The authors propose a two-stage training process: 1) Fine-tuning a language model on a curated text-image dataset to refine initial prompts. 2) Using online reinforcement learning to optimize a policy model, which learns to add modifiers with specific effect ranges and weights to the refined prompts, guided by a reward function that considers aesthetic quality, semantic consistency, and user preference.
Result
PAE generates higher-quality images compared to using short prompts or prompts generated by other methods, evidenced by improved aesthetic scores, CLIP scores, and PickScores. The method demonstrates robust performance on both in-domain and out-of-domain datasets, highlighting its versatility and generalization ability. The learned policy model exhibits a preference for adding modifiers related to art trends, styles, and textures, leading to more visually appealing results without significantly altering the prompt’s original meaning.
LF
The authors acknowledge limitations regarding potential for attribute leakage and missing objects, suggesting the incorporation of control attention maps into the action space for finer control over the generation process as future work. Further improvements could involve integrating additional reward considerations like high resolution and proportional composition to enhance image quality and realism. The paper also suggests exploring techniques to ensure consistent role generation building upon the model’s capability to maintain identity consistency.
Abstract
Text-to-image generative models, specifically those based on diffusion models like Imagen and Stable Diffusion, have made substantial advancements. Recently, there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However, the success of fine-control prompts depends on the accuracy of the text prompts and the careful selection of weights and time steps, which requires significant manual intervention. To address this, we introduce the \textbf{P}rompt \textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original prompts for image generation, we further employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts. The reward function during training encourages the model to consider aesthetic score, semantic consistency, and user preferences. Experimental results demonstrate that our proposed method effectively improves the original prompts, generating visually more appealing images while maintaining semantic alignment. Code is available at https://github.com/Mowenyii/PAE.