Compositional Text-to-Image Generation with Dense Blob Representations
Authors: Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat
What
This paper introduces BlobGEN, a text-to-image generation model that uses dense blob representations as grounding input to improve controllability and compositionality.
Why
This paper addresses the limitations of existing text-to-image models in following complex prompts and offers a modular, user-friendly approach to control image generation by decomposing scenes into semantically rich visual primitives.
How
The authors propose dense blob representations, consisting of blob parameters (specifying location, size, orientation) and blob descriptions (text describing appearance), extracted using existing segmentation and captioning models. They develop a blob-grounded diffusion model with a novel masked cross-attention module to align blobs with corresponding visual features. Additionally, they introduce an in-context learning approach for LLMs to generate blob representations from text, enabling compositional generation.
Result
BlobGEN achieves superior zero-shot generation quality on MS-COCO, showing lower FID scores compared to baseline models. It exhibits strong layout-guided controllability, evidenced by higher region-level CLIP scores and successful object editing and repositioning capabilities. When augmented with LLMs, BlobGEN excels in compositional generation tasks, surpassing LayoutGPT in numerical and spatial accuracy on the NSR-1K benchmark.
LF
Limitations include the inability to perfectly reconstruct images solely from blobs, occasional failures in image editing, and robustness issues with LLM-generated blobs in compositional tasks. Future work could explore combining inversion methods for better reconstruction, advanced editing techniques to reduce editing failures, and improving the integration between LLMs and blob-grounded generation.
Abstract
Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.