GIVT: Generative Infinite-Vocabulary Transformers

Authors: Michael Tschannen, Cian Eastwood, Fabian Mentzer

What

This paper introduces GIVT (Generative Infinite-Vocabulary Transformer), a novel transformer decoder-only architecture capable of generating sequences of real-valued vectors, eliminating the need for quantization used in previous methods like VQ-GAN and MaskGIT.

Why

This work is significant as it presents the first successful attempt at utilizing transformer decoders for generating continuous, unquantized vector sequences, thereby avoiding limitations associated with VQ-based methods. It paves the way for more efficient and higher-quality image generation and representation learning, while also being directly applicable to multimodal interleaved modeling.

How

The authors modify the standard transformer decoder architecture by replacing the input embedding lookup table with a linear projection layer for real-valued vectors and predicting parameters of a Gaussian Mixture Model (GMM) at the output. They train GIVT on the latent space of a \beta-VAE using teacher forcing and masked language modeling approaches, exploring various sampling techniques like temperature sampling, beam search, and a novel distribution-based classifier-free guidance (DB-CFG).

Result

GIVT outperforms VQ-GAN, MaskGIT, and some diffusion models in class-conditional image generation on ImageNet, achieving comparable image quality with a smaller model size and faster sampling. Notably, GIVT demonstrates competitive performance in representation learning and dense prediction tasks like panoptic segmentation and depth estimation using the UViM framework.

LF

Limitations include the challenge of end-to-end training of VAE and GIVT, which is left for future work. The authors suggest exploring applications of GIVT to other data modalities like audio and time-series modeling.

Abstract

We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a -VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework