ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Authors: Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu
What
This paper introduces ELLA, a lightweight adapter that equips text-to-image diffusion models with Large Language Models (LLMs) to enhance text alignment without retraining either model. It achieves this by using a Timestep-Aware Semantic Connector (TSC) that dynamically extracts timestep-dependent conditions from the LLM to guide the diffusion process.
Why
This paper addresses the limitation of existing text-to-image diffusion models that struggle with comprehending and following long, dense prompts containing multiple objects, attributes, and relationships. ELLA provides an efficient and effective solution by leveraging the power of LLMs while remaining compatible with existing diffusion models and tools.
How
The authors propose a novel architecture, ELLA, that connects a frozen pre-trained LLM (e.g., T5-XL, LLaMA-2) with a frozen pre-trained diffusion model (e.g., Stable Diffusion). The key component, TSC, takes text features from the LLM and the current timestep embedding as input, dynamically extracting semantic information relevant to different stages of the denoising process. To train TSC, the authors constructed a large dataset of image-text pairs with dense captions generated by MLLMs. They also introduce a new benchmark, Dense Prompt Graph Benchmark (DPG-Bench), to evaluate models’ ability to follow dense prompts.
Result
ELLA significantly improves the performance of existing diffusion models in following complex prompts. It outperforms state-of-the-art models on DPG-Bench and shows better text-image alignment than SDXL and PixArt-α in user studies while maintaining comparable aesthetic quality. ELLA’s lightweight design allows for easy integration with community models and downstream tools like LoRA and ControlNet, enhancing their prompt-following capabilities. Ablation studies validate the effectiveness of LLM selection, TSC design, and the importance of incorporating timestep information.
LF
The paper acknowledges limitations in their training captions, which are synthesized by MLLM and might be unreliable for shape and spatial relationships. The authors plan to address this by exploring the integration of MLLM with diffusion models to utilize interleaved image-text input. Another limitation is the potential constraint on the aesthetic quality of generated images due to the frozen U-Net. Future work will focus on image editing capabilities and improving aesthetic quality.
Abstract
Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.