ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models

Authors: Fei Kong, Jinhao Duan, Lichao Sun, Hao Cheng, Renjing Xu, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu

What

This paper introduces Adversarial Consistency Training (ACT), a novel method that enhances single-step image generation in consistency training by incorporating a discriminator, leading to faster sampling, reduced resource requirements, and improved generation quality compared to standard consistency training.

Why

The paper addresses the limitations of diffusion models, particularly slow generation speeds due to iterative denoising. While consistency training offers faster sampling, it often compromises generation quality. This research is important because it presents a more efficient and effective approach to single-step image generation using adversarial training within the consistency model framework.

How

The authors first theoretically demonstrate that optimizing consistency training loss minimizes the Wasserstein distance between generated and target distributions, requiring large batch sizes to mitigate accumulating errors. To overcome this, they incorporate a discriminator that directly minimizes the Jensen-Shannon divergence between the distributions at each timestep, similar to GANs. This approach aims to enhance training efficiency and generation quality. The authors conduct experiments on CIFAR10, ImageNet 64x64, and LSUN Cat 256x256 datasets, comparing ACT with existing methods. Additionally, they perform ablation studies to analyze the impact of different components and hyperparameters on the model’s performance.

Result

The proposed ACT method demonstrates superior FID scores compared to standard consistency training on all tested datasets while significantly reducing batch size, model parameters, and training steps. It achieves an FID of 6.0 on CIFAR10 with a batch size of 80, outperforming consistency training with a batch size of 512 (FID 8.7). Similar improvements are observed on ImageNet and LSUN Cat datasets, highlighting ACT’s effectiveness and efficiency.

LF

The authors acknowledge the need for further exploration of the interaction between consistency training loss and adversarial loss for optimizing ACT. They also suggest exploring alternative distance metrics beyond Jensen-Shannon divergence to minimize the gap between distributions. Future research could focus on these aspects to further enhance the performance and stability of ACT.

Abstract

Though diffusion models excel in image generation, their step-by-step denoising leads to slow generation speeds. Consistency training addresses this issue with single-step sampling but often produces lower-quality generations and requires high training costs. In this paper, we show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. As timestep increases, the upper bound accumulates previous consistency training losses. Therefore, larger batch sizes are needed to reduce both current and accumulated losses. We propose Adversarial Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS) divergence between distributions at each timestep using a discriminator. Theoretically, ACT enhances generation quality, and convergence. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 6464 and LSUN Cat 256256 datasets, retains zero-shot image inpainting capabilities, and uses less than of the original batch size and fewer than of the model parameters and training steps compared to the baseline method, this leads to a substantial reduction in resource consumption. Our code is available:https://github.com/kong13661/ACT