Diffusion Model with Perceptual Loss

Authors: Shanchuan Lin, Xiao Yang

What

This paper proposes a novel “self-perceptual” training objective for diffusion models that leverages the model itself as a perceptual network to improve the realism of generated images.

Why

This paper addresses the limitations of relying on classifier-free guidance for improving sample quality in diffusion models by introducing a method that enhances realism without sacrificing diversity, works for both conditional and unconditional generation, and is integrated directly into the training process.

How

The authors propose a “self-perceptual” objective where a frozen copy of the diffusion model, trained with a standard MSE loss, acts as a perceptual network. During training, the online model generates an image, both images are passed through the perceptual network at a randomly sampled timestep, and the MSE loss between their hidden features is backpropagated to the online model.

Result

The self-perceptual objective demonstrably improves the realism of generated images, both qualitatively and quantitatively (FID, IS), compared to models trained solely with MSE loss, particularly in unconditional image generation. However, it doesn’t yet surpass the performance of classifier-free guidance combined with MSE loss for text-to-image generation.

LF

The authors acknowledge that the self-perceptual objective currently doesn’t outperform classifier-free guidance in text-to-image generation. Additionally, they identify grid-like artifacts in the generated images as an area for future investigation. Future work could focus on refining the perceptual loss mechanism, exploring alternative distance functions, and addressing the identified artifacts.

Abstract

Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.