NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation
Authors: Shachar Rosenman, Vasudev Lal, Phillip Howard
What
This paper introduces NeuroPrompts, a novel framework designed to automatically enhance user-provided prompts for text-to-image generation models, leading to higher-quality and more aesthetically pleasing image outputs.
Why
This paper is significant because it addresses the challenge of prompt engineering in text-to-image generation, making these powerful models more accessible to users without specialized expertise by automating the process of crafting effective prompts.
How
The authors developed NeuroPrompts, which uses a two-stage approach: 1) Adapting a pre-trained language model (LM) to generate text similar to human prompt engineers through supervised fine-tuning and reinforcement learning with a reward model based on predicted human preferences (PickScore). 2) Employing NeuroLogic Decoding, a constrained text decoding algorithm, to generate enhanced prompts that satisfy user-specified constraints for style, artist, format, etc., while adhering to the learned prompting style.
Result
The authors demonstrated that NeuroPrompts consistently generates higher-quality images than un-optimized prompts and even surpasses human-authored prompts in terms of aesthetic scores. They also found that both PPO training and constrained decoding with NeuroLogic contribute to the improved performance of the framework.
LF
The authors acknowledge limitations in evaluating NeuroPrompts solely with Stable Diffusion and recognize the potential for societal biases inherited from the base model. Future work could focus on extending NeuroPrompts to video generation models and other domains requiring automated prompt engineering.
Abstract
Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive framework that automatically enhances a user’s prompt to improve the quality of generations produced by text-to-image models. Our framework utilizes constrained text decoding with a pre-trained language model that has been adapted to generate prompts similar to those produced by human prompt engineers. This approach enables higher-quality text-to-image generations and provides user control over stylistic features via constraint set specification. We demonstrate the utility of our framework by creating an interactive application for prompt enhancement and image generation using Stable Diffusion. Additionally, we conduct experiments utilizing a large dataset of human-engineered prompts for text-to-image generation and show that our approach automatically produces enhanced prompts that result in superior image quality. We make our code and a screencast video demo of NeuroPrompts publicly available.