Improving Text-to-Image Consistency via Automatic Prompt Optimization
Authors: Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal
What
This paper introduces OPT2I, an optimization-by-prompting framework for text-to-image (T2I) models that improves prompt-image consistency without requiring parameter updates or training data.
Why
Despite advancements in image quality, T2I models often struggle with accurately representing all elements from the input text prompt in the generated image. This paper is important because it addresses this challenge by leveraging large language models (LLMs) to iteratively refine user prompts and enhance the consistency between the text input and the visual output.
How
OPT2I employs an LLM in conjunction with a pre-trained T2I model and a prompt-image consistency metric (either decomposed CLIPScore or Davidsonian Scene Graph score). The LLM receives an initial user prompt and iteratively generates revised prompts, aiming to maximize the consistency score. The framework then uses the best-performing prompts as in-context examples for subsequent iterations, gradually improving the alignment between the generated images and the user’s intent.
Result
Experimental results demonstrate that OPT2I effectively improves prompt-image consistency across various LLMs, T2I models, and datasets (MSCOCO and PartiPrompts). Notably, OPT2I achieves up to 24.9% improvement in consistency while preserving image quality (FID) and enhancing image diversity (recall). Qualitative analysis suggests that the optimized prompts tend to emphasize initially overlooked visual elements by either providing more detailed descriptions or repositioning them within the prompt.
LF
The paper acknowledges limitations in existing prompt-image consistency metrics, which might not always accurately capture complex relationships or could be susceptible to adversarial examples. The authors suggest further research on more robust consistency metrics as a direction for future work. Another limitation is the computational cost associated with the iterative optimization process.
Abstract
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.