The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

Authors: Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski

What

The paper proposes a fully automated method for generating consistent characters in different contexts using text-to-image diffusion models, taking only a text prompt as input.

Why

This paper addresses a crucial limitation in current text-to-image models: the inability to generate consistent characters across various scenes, which is important for storytelling, game development, and other creative applications. The proposed method offers a fully automated solution, unlike existing manual or limited approaches.

How

The method iteratively refines the representation of a character. It generates a gallery of images from a text prompt, embeds them in a feature space (using DINOv2), clusters the embeddings, and chooses the most cohesive cluster. This cluster is used to personalize a text-to-image model (SDXL) via textual inversion and LoRA, yielding a refined character representation. The process is repeated until convergence, ensuring consistent character generation in diverse contexts.

Result

The method effectively balances prompt adherence and identity consistency compared to baselines like Textual Inversion, LoRA DreamBooth, ELITE, BLIP-diffusion, and IP-adapter. Quantitative analysis and a user study confirm its effectiveness in generating diverse depictions of consistent characters.

LF

The authors acknowledge limitations such as occasional inconsistencies in identity, challenges with consistent supporting characters, potential for spurious attributes, high computational cost, and tendency to generate simplistic scenes. They suggest future work on reducing these limitations and exploring broader applications like story generation and interactive character design.

Abstract

Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, these models struggle with generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach. Project page is available at https://omriavrahami.com/the-chosen-one