Return of Unconditional Generation: A Self-supervised Representation Generation Method

Authors: Tianhong Li, Dina Katabi, Kaiming He

What

This paper introduces Representation-Conditioned Generation (RCG), a novel framework for unconditional image generation that leverages self-supervised representations to guide the generation process, effectively closing the quality gap between unconditional and conditional generation.

Why

This paper is important because it addresses the long-standing challenge of poor-quality unconditional image generation compared to conditional methods. It proposes a method to leverage large-scale unlabeled datasets for training high-quality generative models by effectively utilizing self-supervised representations.

How

The authors propose a three-stage approach: 1) a pre-trained self-supervised encoder maps images to a representation space; 2) a lightweight diffusion model learns to generate representations within this space; 3) a conditional image generator (e.g., ADM, DiT, or MAGE) generates images conditioned on these representations.

Result

RCG significantly improves unconditional generation quality across various image generators and datasets. It achieves state-of-the-art FID scores on ImageNet 256x256, surpassing previous unconditional methods and rivaling leading class-conditional methods. RCG also enables guidance in unconditional generation, further boosting performance. The method allows semantic interpolation by manipulating representations and can be easily extended to class-conditional generation.

LF

The paper mentions that while RCG excels in generating diverse and high-quality images, it still faces challenges in generating text, regular shapes, and realistic humans, similar to other ImageNet generative models. Future work could explore pre-training on larger unlabeled datasets and adapting to various downstream generative tasks with minimal overhead by training only the representation generator on small labeled datasets.

Abstract

Unconditional generation — the problem of modeling data distribution without relying on human-annotated labels — is a long-standing and fundamental challenge in generative models, creating a potential of learning from large-scale unlabeled data. In the literature, the generation quality of an unconditional method has been much worse than that of its conditional counterpart. This gap can be attributed to the lack of semantic information provided by labels. In this work, we show that one can close this gap by generating semantic representations in the representation space produced by a self-supervised encoder. These representations can be used to condition the image generator. This framework, called Representation-Conditioned Generation (RCG), provides an effective solution to the unconditional generation problem without using labels. Through comprehensive experiments, we observe that RCG significantly improves unconditional generation quality: e.g., it achieves a new state-of-the-art FID of 2.15 on ImageNet 256x256, largely reducing the previous best of 5.91 by a relative 64%. Our unconditional results are situated in the same tier as the leading class-conditional ones. We hope these encouraging observations will attract the community’s attention to the fundamental problem of unconditional generation. Code is available at https://github.com/LTH14/rcg.