Diffusion Model Alignment Using Direct Preference Optimization

Authors: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik

What

This paper introduces Diffusion-DPO, a new method for aligning text-to-image diffusion models with human preferences by directly optimizing the model on pairwise comparison data, adapting the Direct Preference Optimization (DPO) technique from language models.

Why

This paper is significant because it bridges the gap in aligning diffusion models to human preferences, similar to advancements made with Large Language Models (LLMs), leading to improved visual appeal and text alignment in generated images.

How

The authors adapted DPO to diffusion models by defining a notion of data likelihood under the model and using the evidence lower bound (ELBO) to derive a differentiable objective. They demonstrate Diffusion-DPO by fine-tuning state-of-the-art text-to-image diffusion models like Stable Diffusion XL (SDXL) on the Pick-a-Pic dataset, and evaluating performance through human evaluation and automated metrics.

Result

Diffusion-DPO significantly improves both visual appeal and prompt alignment in generated images, outperforming even the larger SDXL model with a refinement stage. The authors also demonstrate the effectiveness of learning from AI feedback using Diffusion-DPO, offering a potential for scaling this alignment method.

LF

Limitations include ethical considerations related to potential biases in web-collected data and user preferences. Future work involves dataset cleaning and scaling, online learning methods for DPO, and personalized tuning for individual or group preferences.

Abstract

Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users’ preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.