GLoD: Composing Global Contexts and Local Details in Image Generation

Authors: Moyuru Yamada

What

This paper introduces Global-Local Diffusion (GLoD), a novel framework for controllable text-to-image generation using diffusion models, which allows simultaneous control over global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) by composing multiple global and local prompts.

Why

This paper addresses the limitation of existing diffusion-based text-to-image generation methods that struggle to simultaneously control both global and local aspects of the generated image. GLoD offers a training-free approach for more complex and controllable image synthesis, which is crucial for real-world applications.

How

GLoD leverages pre-trained diffusion models and utilizes a novel layer composition approach. It takes global and local prompts as input, generates separate noises for each prompt, and then composes them effectively using global and local guidance mechanisms. This allows the model to incorporate both global context and local details into the generated image.

Result

GLoD demonstrates superior performance in generating complex images that adhere to both global contexts and local details specified by the user. Quantitative evaluation shows improved alignment scores for both global and local attributes compared to existing methods, demonstrating better controllability. GLoD also effectively reduces undesirable attribute interference between objects in a scene.

LF

One limitation identified is the potential for partial object appearance changes when the latent representation of the object differs significantly between the global and local prompts. Future work could explore techniques to mitigate this issue. Additionally, expanding the framework to handle more complex relationships between objects and exploring its application to other domains like video or 3D object generation are promising directions.

Abstract

Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.