Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
Authors: Shihao Zhao, Shaozhe Hao, Bojia Zi, Huaizhe Xu, Kwan-Yee K. Wong
What
This paper introduces LaVi-Bridge, a novel framework designed for text-to-image diffusion models, aiming to seamlessly integrate various pre-trained language models and generative vision models.
Why
This research is crucial due to the rapid advancements in both language and vision models, making it challenging to integrate them into existing text-to-image diffusion models. LaVi-Bridge addresses this challenge by offering a flexible and efficient way to combine diverse models, potentially leading to significant improvements in text-to-image generation capabilities.
How
LaVi-Bridge employs LoRA (Low-Rank Adaptation) to inject trainable parameters into pre-trained language and vision models without altering their original weights. Additionally, it utilizes an adapter to bridge the gap between these two modules, facilitating effective communication and alignment. The framework is trained on a dataset of text-image pairs, enabling the integrated models to generate coherent and contextually relevant images from textual prompts.
Result
Experiments demonstrate LaVi-Bridge’s effectiveness in integrating various language models (CLIP, T5 series, Llama-2) and vision models (U-Net, Vision Transformer). Notably, incorporating superior models leads to enhanced performance, such as improved semantic understanding with advanced language models (e.g., Llama-2) and enhanced image quality and aesthetics with powerful vision models (e.g., PixArt’s Transformer).
LF
The authors acknowledge that while LaVi-Bridge exhibits promising results, training with it on the same models and weights as existing text-to-image diffusion models may not always yield significant improvements. They emphasize that LaVi-Bridge primarily aims to integrate diverse language and vision models, enabling the use of more advanced models for potential performance enhancements. Future research directions could explore larger and more diverse datasets to further improve LaVi-Bridge’s versatility and address the limitations associated with training data diversity.
Abstract
Text-to-image generation has made significant advancements with the introduction of text-to-image diffusion models. These models typically consist of a language model that interprets user prompts and a vision model that generates corresponding images. As language and vision models continue to progress in their respective domains, there is a great potential in exploring the replacement of components in text-to-image diffusion models with more advanced counterparts. A broader research objective would therefore be to investigate the integration of any two unrelated language and generative vision models for text-to-image generation. In this paper, we explore this objective and propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and plug-and-play approach without requiring modifications to the original weights of the language and vision models. Our pipeline is compatible with various language models and generative vision models, accommodating different structures. Within this framework, we demonstrate that incorporating superior modules, such as more advanced language models or generative vision models, results in notable improvements in capabilities like text alignment or image quality. Extensive evaluations have been conducted to verify the effectiveness of LaVi-Bridge. Code is available at https://github.com/ShihaoZhaoZSH/LaVi-Bridge.