Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models

Authors: Nikita Starodubcev, Artem Fedorov, Artem Babenko, Dmitry Baranchuk

What

This paper explores a novel approach to text-to-image synthesis using an adaptive collaboration framework between a distilled student diffusion model and a teacher diffusion model.

Why

The paper addresses the limitations of existing distillation methods for diffusion models, which often compromise image quality while achieving faster inference. The proposed approach aims to combine the efficiency of distilled models with the high fidelity of teacher models, potentially leading to a new paradigm in text-to-image generation.

How

The authors first analyze the performance of distilled text-to-image models and observe that a significant portion of the samples generated by students can be superior to the teacher. Based on this, they propose an adaptive pipeline where the student model generates an initial sample. An oracle, implemented using an image quality estimator (ImageReward), then decides whether to refine this sample further using the teacher model. This decision is made based on a learned threshold. The refinement process can be either a regeneration of the sample from scratch using the teacher or a refinement of the student’s output.

Result

The proposed adaptive collaboration framework outperforms existing text-to-image baselines in terms of both human preference and automated metrics (FID, CLIP score, ImageReward) under various inference budgets. The method achieves a 2.5x to 5x speedup compared to standard diffusion models while maintaining or even surpassing their quality. Furthermore, the approach is successfully applied to text-guided image editing and controllable generation tasks, demonstrating its versatility and potential for broader applications.

LF

The authors acknowledge the limitations of current automated image quality estimators as a potential bottleneck for their approach. Future work could focus on developing more accurate estimators that better correlate with human preferences. Additionally, investigating the applicability of other fast text-to-image generation methods besides distillation, such as GANs, within their adaptive framework is suggested.

Abstract

Knowledge distillation methods have recently shown to be a promising direction to speedup the synthesis of large-scale diffusion models by requiring only a few inference steps. While several powerful distillation methods were recently proposed, the overall quality of student samples is typically lower compared to the teacher ones, which hinders their practical usage. In this work, we investigate the relative quality of samples produced by the teacher text-to-image diffusion model and its distilled student version. As our main empirical finding, we discover that a noticeable portion of student samples exhibit superior fidelity compared to the teacher ones, despite the “approximate” nature of the student. Based on this finding, we propose an adaptive collaboration between student and teacher diffusion models for effective text-to-image synthesis. Specifically, the distilled model produces the initial sample, and then an oracle decides whether it needs further improvements with a slow teacher model. Extensive experiments demonstrate that the designed pipeline surpasses state-of-the-art text-to-image alternatives for various inference budgets in terms of human preference. Furthermore, the proposed approach can be naturally used in popular applications such as text-guided image editing and controllable generation.