Diffusion Model as Representation Learner

Authors: Xingyi Yang, Xinchao Wang

What

This paper investigates the potential of Diffusion Probabilistic Models (DPMs) for representation learning and proposes RepFusion, a novel knowledge transfer method that leverages pre-trained DPMs to enhance performance in recognition tasks like image classification and semantic segmentation.

Why

This paper is important because it explores the under-utilized representation learning capability of DPMs, going beyond their traditional generative applications. It offers a new perspective on leveraging pre-trained generative models for improved performance in discriminative tasks.

How

The authors first establish a theoretical connection between DPMs and denoising autoencoders, demonstrating the time-dependent nature of DPM latent space. They then introduce RepFusion, which uses reinforcement learning to dynamically select optimal time steps for distilling knowledge from a pre-trained DPM into a student network. This student network is then fine-tuned for specific recognition tasks.

Result

RepFusion consistently outperforms baseline models and other self-supervised learning methods on various benchmarks, including CIFAR-10, Tiny-ImageNet, CelebAMask-HQ, and WFLW. Notably, it shows significant improvements in semantic segmentation, particularly in challenging scenarios with large pose variations and occlusions.

LF

The paper acknowledges the limitations of existing work on utilizing DPMs for representation learning, such as complex model modifications. As future work, the authors suggest exploring the time-step selection strategy further. Additionally, they highlight the need for a deeper understanding of the relationship between the chosen time step and the specific downstream task.

Abstract

Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks.Despite its promises, the learned representations of pre-trained DPMs, however, have not been fully understood. In this paper, we conduct an in-depth investigation of the representation power of DPMs, and propose a novel knowledge transfer method that leverages the knowledge acquired by generative DPMs for recognition tasks. Our study begins by examining the feature space of DPMs, revealing that DPMs are inherently denoising autoencoders that balance the representation learning with regularizing model capacity. To this end, we introduce a novel knowledge transfer paradigm named RepFusion. Our paradigm extracts representations at different time steps from off-the-shelf DPMs and dynamically employs them as supervision for student networks, in which the optimal time is determined through reinforcement learning. We evaluate our approach on several image classification, semantic segmentation, and landmark detection benchmarks, and demonstrate that it outperforms state-of-the-art methods. Our results uncover the potential of DPMs as a powerful tool for representation learning and provide insights into the usefulness of generative models beyond sample generation. The code is available at \url{https://github.com/Adamdad/Repfusion}.