AltDiffusion: A Multilingual Text-to-Image Diffusion Model

Authors: Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu

What

This paper introduces AltDiffusion, a novel multilingual text-to-image diffusion model capable of generating images from prompts in eighteen different languages.

Why

This paper is important because it addresses the language limitations of existing text-to-image models, making them accessible to a wider global audience and improving their ability to understand and generate images from prompts with culture-specific concepts.

How

The authors first train a multilingual text encoder using knowledge distillation from a pre-trained English CLIP model. This encoder is then integrated into a pre-trained English diffusion model and fine-tuned using a two-stage training schema. The first stage aligns the text encoder and the diffusion model’s embedding space, while the second stage focuses on improving the quality of generated images using a high-quality multilingual dataset and classifier-free guidance.

Result

AltDiffusion outperforms existing multilingual text-to-image models in terms of both image quality and multilingual understanding, especially on culture-specific concepts. It achieves comparable results to the English Stable Diffusion model on general prompts and exhibits better performance in understanding and generating images from prompts containing culture-specific concepts.

LF

The paper does not explicitly mention limitations, but future work could explore expanding the model to support more languages, improving the generation quality for certain languages, and further evaluating the model’s capabilities in different downstream applications.

Abstract

Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson.