Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Authors: Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov

What

This paper introduces Kandinsky, a novel text-to-image generation model based on latent diffusion architecture, combining image prior models with latent diffusion techniques, and demonstrates its capabilities through various generation modes and state-of-the-art performance on image generation quality.

Why

This paper is important because it presents a novel approach to text-to-image generation using a combination of image prior and latent diffusion, achieves state-of-the-art performance on image generation quality, and provides a fully open-source implementation of the model and user-friendly tools like a web application and Telegram bot, making it accessible for various applications.

How

The authors developed Kandinsky by training an image prior model to map text embeddings to image embeddings of CLIP and utilizing a modified MoVQ implementation as the image autoencoder component. They conducted experiments on the COCO-30K dataset using FID-CLIP curves and human evaluation to assess the performance of different configurations, including various image prior setups and the effect of latent quantization.

Result

Kandinsky achieved a FID score of 8.03 on the COCO-30K dataset, making it the top open-source performer in terms of measurable image generation quality. The study found that a simple linear mapping for image prior yielded the best FID score, suggesting a potential linear relationship between visual and textual embedding spaces. Additionally, quantization of latent codes in MoVQ slightly improved image quality.

LF

Limitations mentioned include the need for further research to enhance the semantic coherence between text and generated images and improve FID scores and image quality based on human evaluation. Future work will focus on exploring newer image encoders, developing more efficient UNet architectures, improving text prompt understanding, generating higher-resolution images, and investigating new features like local image editing and addressing the potential for generating harmful content.

Abstract

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.