Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Authors: Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou

What

This paper introduces Ranni, a text-to-image generation framework that enhances the controllability and accuracy of existing diffusion models by using a ‘semantic panel’ as a structured intermediary representation between text prompts and images.

Why

This paper is important because it addresses the limitations of current text-to-image models in interpreting complex prompts by introducing a novel semantic panel that facilitates better text-image alignment and offers more intuitive editing capabilities.

How

The authors propose Ranni, a framework that leverages Large Language Models (LLMs) to translate text prompts into a structured ‘semantic panel’ containing visual concepts with attributes like bounding boxes, colors, and keypoints. This panel then guides a diffusion model to generate images that adhere more closely to the input text. They also introduce an automatic data preparation pipeline and conduct experiments on various prompts to evaluate Ranni’s ability to follow instructions related to quantity, spatial relationships, attribute binding, and multiple objects.

Result

Ranni demonstrates superior performance in following complex prompts compared to existing methods, particularly in terms of quantity awareness and spatial relationship understanding. It also shows promise as a unified image creation system, enabling interactive editing through manual manipulation or LLM-driven instructions in a chat-based interface.

LF

The authors identify limitations, such as occasional inaccuracies in the initial semantic panel generation and the need for further exploration in controlling object appearance beyond bounding boxes. Future work could focus on improving the precision of the semantic panel, exploring alternative LLM architectures, and expanding the range of controllable attributes for enhanced editing capabilities.

Abstract

Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at https://ranni-t2i.github.io/Ranni.