Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

Authors: Jianxiang Lu, Cong Xie, Hui Guo

What

This paper presents a novel object-driven one-shot fine-tuning method for text-to-image diffusion models, enabling the generation of diverse images with specific objects from a single input image and region of interest.

Why

This paper is significant because it addresses the challenges of limited data and object fidelity in personalized text-to-image generation. It allows for efficient object implantation and diverse image synthesis with high fidelity using only one reference image, advancing the field of content creation.

How

The authors leverage prototypical embedding for initialization, class-characterizing regularization to preserve class diversity, and an object-specific loss function to enhance fidelity. They fine-tune a pre-trained stable diffusion model using a single image and its object mask, and compare their method with existing techniques through qualitative and quantitative evaluations.

Result

The proposed method outperforms existing one-shot fine-tuning methods in terms of both object fidelity and generalization ability. It effectively mitigates overfitting and allows for the generation of diverse images with the target object while maintaining consistency with text prompts. The method also demonstrates success in multi-object implantation, enabling the creation of compositions with user-specified objects.

LF

The authors acknowledge limitations in handling objects with complex edges, which can lead to degraded image quality. They also point out that smaller objects may have reduced fidelity in the generated images. Future work will focus on improving mask acquisition methods and incorporating multi-scale perception mechanisms for objects to address these limitations.

Abstract

As large-scale text-to-image generation models have made remarkable progress in the field of text-to-image generation, many fine-tuning methods have been proposed. However, these models often struggle with novel objects, especially with one-shot scenarios. Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way, using only a single input image and the object-specific regions of interest. To improve generalizability and mitigate overfitting, in our paradigm, a prototypical embedding is initialized based on the object’s appearance and its class, before fine-tuning the diffusion model. And during fine-tuning, we propose a class-characterizing regularization to preserve prior knowledge of object classes. To further improve fidelity, we introduce object-specific loss, which can also use to implant multiple objects. Overall, our proposed object-driven method for implanting new objects can integrate seamlessly with existing concepts as well as with high fidelity and generalization. Our method outperforms several existing works. The code will be released.