Stylus: Automatic Adapter Selection for Diffusion Models

Authors: Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica

What

This paper introduces Stylus, a system designed to automatically select and compose fine-tuned adapters for Stable Diffusion models based on user prompts to enhance image quality and diversity.

Why

This paper addresses the challenge of leveraging the vast and growing number of publicly available adapters for Stable Diffusion, which are often poorly documented and require manual selection. Stylus automates this process, making it easier for users to generate high-quality images by automatically identifying and combining relevant adapters based on the prompt, leading to improvements in visual fidelity, textual alignment, and image diversity.

How

Stylus utilizes a three-stage framework: 1) Refiner: Employs a VLM to process adapter model cards and generate improved textual descriptions and embeddings for each adapter. 2) Retriever: Retrieves candidate adapters relevant to the user prompt by calculating cosine similarity scores between the prompt embedding and adapter embeddings. 3) Composer: Segments the prompt into keywords representing distinct tasks and assigns relevant adapters to each task using a long-context LLM, effectively filtering irrelevant adapters. Additionally, a masking strategy ensures diversity by applying different adapter combinations for a single prompt.

Result

Stylus demonstrates significant improvements over baseline Stable Diffusion models and alternative retrieval methods. Key results include: - Achieves a higher preference score (2:1) compared to baseline models in human evaluations. - Demonstrates better CLIP/FID Pareto efficiency, indicating superior visual fidelity and textual alignment. - Generates more diverse images per prompt, as evidenced by quantitative metrics (dFId) and VLM-based assessments. - Proves effective for various image-to-image tasks, including image translation and inpainting.

LF

The paper acknowledges limitations and suggests areas for future work: - Task Blocking: Composer may not fully prevent adapters from overriding existing concepts within the prompt. - Task Diversity: Merging adapters may reduce diversity in generating instances of a single task. - Low-quality Adapters: Blacklisting low-quality adapters is challenging, and some might still be selected. - Retrieval Errors: Refiner and Composer may introduce errors, leading to suboptimal adapter choices. Future work could explore: - Developing more robust solutions to address task blocking and diversity. - Improving the accuracy and efficiency of the Refiner and Composer components. - Investigating alternative masking schemes for enhanced diversity.

Abstract

Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt’s keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts’ keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See stylus-diffusion.github.io for more.