LCM-Lookahead for Encoder-based Text-to-Image Personalization
Authors: Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or
What
This paper introduces a novel approach called LCM-Lookahead for enhancing encoder-based text-to-image personalization, specifically focusing on improving identity preservation and prompt alignment in generated facial images.
Why
This paper addresses the limitations of existing encoder-based personalization methods that often struggle to maintain identity fidelity and struggle with prompt alignment, particularly in stylized images, by proposing a novel training scheme and a shortcut mechanism to incorporate image-space losses during training.
How
The authors leverage a fast-sampling Latent Consistency Model (LCM) as a ‘shortcut’ to preview the final denoised image during training. This preview is used to calculate an identity loss, providing a better training signal for identity preservation. They also introduce an attention-sharing mechanism to transfer visual features from the conditioning image and generate a consistent synthetic dataset using SDXL-Turbo to improve prompt alignment.
Result
The proposed method demonstrates superior performance in preserving facial identity and aligning with textual prompts, even in stylized images, compared to existing state-of-the-art encoder-based methods. Both quantitative and qualitative evaluations, including a user study, confirm the effectiveness of their approach.
LF
The authors acknowledge limitations in handling out-of-domain images and potential biases inherited from the backbone model. Future work involves exploring optimization-based methods on top of their approach to further enhance quality and address potential ethical concerns related to facial editing technology.
Abstract
Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.