Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis

Authors: Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yujun Shen

What

This paper introduces Aurora, a text-to-image GAN model that leverages Sparse Mixture of Experts (MoE) to enhance model capacity and generate high-quality images from text descriptions.

Why

This paper is important because it addresses the limitations of GANs in text-to-image synthesis, particularly their difficulty in scaling up to handle complex datasets and open-vocabulary text prompts. By incorporating Sparse MoE, Aurora achieves comparable performance to diffusion models while maintaining faster generation speeds. The release of their code and checkpoints also provides a valuable resource for the research community to further explore and advance text-to-image generation with GANs.

How

The authors developed Aurora, a GAN-based text-to-image generator, incorporating a Sparse Mixture of Experts (MoE) approach. The generator uses CLIP to encode the input text and a mapping network to process both the text and a latent code. A series of generative blocks, each with a convolution block and an attention block, progressively increase the resolution of the generated image. The attention block employs MoE, utilizing a sparse router to select the most appropriate expert for each feature point based on both the input feature and text information. The model is trained progressively on LAION2B-en and COYO-700M datasets using a combination of adversarial loss, matching-aware loss, multi-level CLIP loss, and MoE loss. The authors use reference FID scores as an indicator to transition between training stages at different image resolutions.

Result

Aurora achieves a 6.2 zero-shot FID score on MS COCO at 64x64 resolution, demonstrating its capability for open-vocabulary text-to-image synthesis. The authors also found that their sparse router effectively clusters pixels with similar visual concepts. Interestingly, they observed unexpected behavior during latent space interpolation, suggesting a potential research direction in disentangling text conditions and sampling stochasticity.

LF

The paper acknowledges limitations in latent space interpolation, attributing them to the absence of perceptual path length regularization and potential dominance of text tokens over the global latent code. Future work includes investigating these issues, exploring better text information injection methods, and improving the model’s performance and functionality using cleaner, higher-quality datasets.

Abstract

Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development.