PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

Authors: Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu

What

This paper introduces PEA-Diffusion, a novel method using a plug-and-play adapter and knowledge distillation to adapt English-based text-to-image diffusion models for non-English languages and culture-specific image generation.

Why

This paper is important because it addresses the limitations of current text-to-image models that primarily focus on English, making them accessible to non-English speakers and enabling the generation of culturally relevant images.

How

The authors propose PEA-Diffusion, which uses a lightweight MLP adapter and knowledge distillation from a pre-trained English diffusion model (Stable Diffusion) to guide the learning of a non-English counterpart. They freeze the parameters of the original model, train the adapter with a small parallel corpus, and employ a hybrid training strategy that leverages both parallel and culture-specific image-text pairs.

Result

PEA-Diffusion achieves significant improvements over baseline methods like translation, AltDiffusion, and GlueGen, particularly in generating culturally relevant images. It demonstrates superior performance on CLIPScore for culture-specific prompts, retains strong performance on general prompts, and exhibits low training costs and plug-and-play capabilities with other downstream tasks like LoRA, ControlNet, and Inpainting.

LF

The paper acknowledges limitations in the performance of language-specific CLIP encoders, potentially hindering the model’s generalizability. Additionally, the approach is limited by the capabilities of the base English model. Future work aims to address these limitations and explore further improvements in both general and culture-specific image generation.

Abstract

Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation. Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion