Style Aligned Image Generation via Shared Attention

Authors: Amir Hertz, Andrey Voynov, Shlomi Fruchter, Daniel Cohen-Or

What

This paper introduces StyleAligned, a method for generating sets of images with consistent styles from text prompts using pre-trained text-to-image diffusion models.

Why

This paper is important because it offers a new approach to controlling the style of generated images in text-to-image synthesis, which has been a challenging problem. Existing methods often require expensive fine-tuning or struggle to maintain consistency across different prompts, while StyleAligned achieves this without any training or optimization.

How

StyleAligned works by introducing a shared attention mechanism into the diffusion process. When generating a set of images, each image attends to the features of a reference image, typically the first in the batch, during specific layers in the diffusion process. This attention sharing is further enhanced by using Adaptive Instance Normalization (AdaIN) to balance attention flow and improve style alignment.

Result

The paper shows that StyleAligned outperforms existing T2I personalization methods, such as StyleDrop and DreamBooth, in terms of style consistency while maintaining good alignment with text prompts. Notably, it generates more coherent sets of images with shared stylistic elements, as evidenced by both qualitative examples and quantitative metrics using CLIP and DINO embeddings. Furthermore, the method is flexible and can be integrated with other diffusion-based techniques like ControlNet and MultiDiffusion, demonstrating its potential for various applications.

LF

The paper acknowledges limitations in controlling the degree of shape and appearance similarity between generated images and highlights the need for improved diffusion inversion techniques. Future work could focus on these aspects and explore the use of StyleAligned for creating large, style-aligned datasets to train novel text-to-image models.

Abstract

Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields, generating visually compelling outputs from textual prompts. However, controlling these models to ensure consistent style remains challenging, with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper, we introduce StyleAligned, a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing’ during the diffusion process, our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method’s evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity, underscoring its efficacy in achieving consistent style across various inputs.