Direct Consistency Optimization for Compositional Text-to-Image Personalization

Authors: Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin

What

This paper introduces Direct Consistency Optimization (DCO), a novel fine-tuning objective for Text-to-Image (T2I) diffusion models that improves personalized image generation by maximizing consistency to reference images while minimizing deviation from the pretrained model.

Why

This paper is important because it addresses the limitations of current personalized T2I models, which often struggle to balance subject consistency with the ability to generate diverse images in different scenarios or styles. DCO offers a more principled approach to fine-tuning, resulting in more compositional and controllable image generation.

How

The authors formulate fine-tuning as a constrained policy optimization problem, encouraging the model to learn minimal information from reference images while retaining knowledge from the pretrained model. They derive an upper bound to this objective, leading to the DCO loss, which is as easy to implement as the standard diffusion loss. They also introduce a ‘reward guidance’ sampling method to control the trade-off between subject fidelity and text prompt fidelity and emphasize the importance of using comprehensive captions for reference images.

Result

DCO outperforms baselines like DreamBooth and its variants in subject and style personalization tasks. Notably, DCO generates images with higher fidelity to both subjects and input text prompts, as evidenced by quantitative metrics and qualitative examples. It also enables the seamless composition of independently fine-tuned subject and style models without requiring additional post-processing steps like ZipLoRA.

LF

The authors acknowledge the increased computational burden of DCO during both training and inference due to the additional forward passes through the pretrained model. They suggest exploring efficient fine-tuning methods to enhance scalability. Additionally, while cosine similarity was used to assess LoRA compatibility, the authors acknowledge the need for further investigation into metrics that accurately capture interference between LoRA models.

Abstract

Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. However, they still lack in synthesizing images of different scenarios or styles that are possible in the original pretrained models. To address this, we propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model. We devise a novel training objective for T2I diffusion models that minimally fine-tunes the pretrained model to achieve consistency. Our method, dubbed \emph{Direct Consistency Optimization}, is as simple as regular diffusion loss, while significantly enhancing the compositionality of personalized T2I models. Also, our approach induces a new sampling method that controls the tradeoff between image fidelity and prompt fidelity. Lastly, we emphasize the necessity of using a comprehensive caption for reference images to further enhance the image-text alignment. We show the efficacy of the proposed method on the T2I personalization for subject, style, or both. In particular, our method results in a superior Pareto frontier to the baselines. Generated examples and codes are in our project page( https://dco-t2i.github.io/).