Reward Guided Latent Consistency Distillation
Authors: Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang
What
This paper introduces Reward Guided Latent Consistency Distillation (RG-LCD), a method for enhancing the efficiency and quality of text-to-image synthesis by incorporating feedback from a reward model (RM) into the Latent Consistency Distillation (LCD) process.
Why
This paper is important because it addresses the limitations of current Latent Consistency Models (LCMs) for text-to-image synthesis, which prioritize inference speed over sample quality. By integrating human preference through RMs, RG-LCD improves LCMs’ generated image quality without sacrificing inference speed.
How
The authors propose RG-LCD, which integrates feedback from a differentiable RM into the LCD process by augmenting the original LCD loss with a reward maximization objective. To avoid reward over-optimization, they introduce a latent proxy RM (LRM) that connects the LCM to the RM, enabling indirect optimization of the expert RM and allowing learning from non-differentiable RMs. They conduct experiments using different RMs (CLIPScore, HPSv2.1, PickScore, ImageReward) and evaluate the generated images with human evaluation and automatic metrics like HPSv2.1 score and FID.
Result
Human evaluation shows that the 2-step generations from RG-LCM (HPS) are preferred over the 50-step DDIM generations from the teacher LDM, indicating a 25x speedup without quality loss. RG-LCM (CLIP), despite using a non-preference-trained RM, also outperforms the teacher LDM in 4-step generations. The study found that using an LRM effectively mitigates reward over-optimization, leading to more visually appealing images and addressing the high-frequency noise issue observed when directly optimizing for certain RMs like ImageReward. Interestingly, the results also reveal discrepancies between human preferences and automatic metric scores, suggesting current metrics like HPSv2.1 may not fully capture human preferences, particularly concerning high-frequency noise due to the use of image resizing during evaluation.
LF
The authors acknowledge limitations in existing automatic metrics for evaluating image quality and call for the development of more robust metrics that eliminate image resizing in their evaluation process. They also suggest exploring the use of LRMs to learn human preferences directly in the latent space as a potential solution. Future work could involve investigating alternative LRM architectures, exploring different reward models and datasets, and applying RG-LCD to other generative modeling tasks beyond text-to-image synthesis.
Abstract
Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM’s efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM’s output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM’s single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25 times inference acceleration without quality loss. As directly optimizing towards differentiable RMs can suffer from over-optimization, we overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on HPSv2’s test set, surpassing those achieved by the baseline LCM.