Controllable Image Generation With Composed Parallel Token Prediction
Authors: Jamie Stirling, Noura Al-Moubayed
What
This paper presents a novel method for controllable compositional image generation using discrete generative models, achieving state-of-the-art accuracy by composing log-probability outputs from models like VQ-VAE and VQ-GAN.
Why
This paper is important as it enables the composition of discrete generative processes for image generation, unlike previous methods focused on continuous models. This allows for benefits such as improved efficiency, interpretability, and controllability, which are demonstrated through state-of-the-art results on multiple datasets.
How
The authors derive a formulation for composing discrete generation processes by leveraging the product of conditional probabilities of individual concepts, assuming their independence. They apply this to parallel token prediction, generating images by iteratively unmasking discrete image representations conditioned on multiple input attributes using VQ-VAE/VQ-GAN. They further introduce concept weighting to control the relative importance of different conditions.
Result
The proposed method achieves state-of-the-art generation accuracy on FFHQ, Positional CLEVR, and Relational CLEVR datasets, surpassing previous methods while maintaining competitive FID scores. It also demonstrates strong generalization ability, including out-of-distribution generation and concept negation, while being significantly faster than comparable continuous compositional methods.
LF
The authors acknowledge the limitations of assuming independence between input conditions and the increased computational cost compared to non-compositional approaches. Future work could explore methods for handling condition dependencies and optimizing concept weighting.
Abstract
Compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training (compositional generalisation). Despite recent progress in compositional image generation via composing continuous sampling processes such as diffusion and energy-based models, composing discrete generative processes has remained an open challenge, with the promise of providing improvements in efficiency, interpretability and simplicity. To this end, we propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space. Our approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art generation accuracy in three distinct settings (FFHQ, Positional CLEVR and Relational CLEVR) while attaining competitive Fr’echet Inception Distance (FID) scores. Our method attains an average generation accuracy of across the studied settings. Our method also outperforms the next-best approach (ranked by accuracy) in terms of FID in seven out of nine experiments, with an average FID of (an average improvement of ). Furthermore, our method offers a to speedup over comparable continuous compositional methods on our hardware. We find that our method can generalise to combinations of input conditions that lie outside the training data (e.g. more objects per image) in addition to offering an interpretable dimension of controllability via concept weighting. We further demonstrate that our approach can be readily applied to an open pre-trained discrete text-to-image model without any fine-tuning, allowing for fine-grained control of text-to-image generation.