Self-correcting LLM-controlled Diffusion Models

Authors: Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell

What

This paper introduces Self-correcting LLM-controlled Diffusion (SLD), a framework that improves text-to-image generation by iteratively identifying and correcting inaccuracies in images generated by diffusion models using an LLM and an object detector.

Why

This paper is important because it addresses a key limitation of current text-to-image diffusion models, which often struggle to accurately interpret and follow complex prompts. SLD provides a training-free method to improve the alignment between generated images and text prompts, leading to more accurate and reliable text-to-image generation.

How

SLD employs a closed-loop approach. First, an image is generated from the input prompt using an off-the-shelf diffusion model. Then, an LLM parser extracts key objects from the prompt, which are then located in the image using an open-vocabulary object detector. Next, an LLM controller compares the detected objects with the prompt and suggests corrections (addition, deletion, repositioning, attribute modification). Finally, these corrections are implemented in the latent space of the diffusion model to generate a corrected image. This process can be repeated iteratively.

Result

SLD significantly improves image generation accuracy, particularly in handling numeracy, attribute binding, and spatial relationships. It outperforms existing methods on the LMD benchmark and shows significant improvements when applied to models like DALL-E 3. Additionally, SLD can be easily adapted for image editing tasks, achieving fine-grained control over object manipulation.

LF

One limitation is the difficulty in handling objects with complex shapes due to limitations in the object segmentation module. Future work could explore better region selection methods for improved generation and editing quality. Additionally, the authors suggest exploring the integration of advanced LMMs for more streamlined image assessment and editing.

Abstract

Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications.