Capability-aware Prompt Reformulation Learning for Text-to-Image Generation

Authors: Jingtao Zhan, Qingyao Ai, Yiqun Liu, Jia Chen, Shaoping Ma

What

This paper presents CAPR, a novel capability-aware prompt reformulation framework designed for text-to-image generation, which leverages user interaction logs to automatically improve user prompts.

Why

This work addresses the challenge of crafting effective prompts for text-to-image generation systems, a task often difficult for average users. It’s significant because it’s the first to leverage interaction logs for this purpose, offering a practical solution to enhance user experience and generation quality.

How

The authors analyze interaction logs to understand user reformulation patterns and develop CAPR, comprising a Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). They train CRM on reformulation pairs, conditioned on CCF representing user capability. During inference, CCF is optimized to guide CRM towards high-quality reformulations.

Result

Experimental results demonstrate that CAPR significantly outperforms various baselines, including large language models and models trained on synthetic data. It exhibits strong performance on both seen and unseen text-to-image generation systems, demonstrating its effectiveness and robustness.

LF

The paper acknowledges that finding the optimal configuration for CCF can be time-consuming, though mitigated by techniques like Bayesian optimization. Future work could explore alternative CCF representations or personalize reformulations based on individual user styles.

Abstract

Text-to-image generation systems have emerged as revolutionary tools in the realm of artistic creation, offering unprecedented ease in transforming textual prompts into visual art. However, the efficacy of these systems is intricately linked to the quality of user-provided prompts, which often poses a challenge to users unfamiliar with prompt crafting. This paper addresses this challenge by leveraging user reformulation data from interaction logs to develop an automatic prompt reformulation model. Our in-depth analysis of these logs reveals that user prompt reformulation is heavily dependent on the individual user’s capability, resulting in significant variance in the quality of reformulation pairs. To effectively use this data for training, we introduce the Capability-aware Prompt Reformulation (CAPR) framework. CAPR innovatively integrates user capability into the reformulation process through two key components: the Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). CRM reformulates prompts according to a specified user capability, as represented by CCF. The CCF, in turn, offers the flexibility to tune and guide the CRM’s behavior. This enables CAPR to effectively learn diverse reformulation strategies across various user capacities and to simulate high-capability user reformulation during inference. Extensive experiments on standard text-to-image generation benchmarks showcase CAPR’s superior performance over existing baselines and its remarkable robustness on unseen systems. Furthermore, comprehensive analyses validate the effectiveness of different components. CAPR can facilitate user-friendly interaction with text-to-image systems and make advanced artistic creation more achievable for a broader range of users.