StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models

Authors: Zhizhong Wang, Lei Zhao, Wei Xing

What

This paper presents StyleDiffusion, a novel content-style disentangled framework for artistic style transfer that leverages diffusion models for explicit content extraction and implicit style learning, enabling interpretable and controllable style transfer.

Why

This paper is significant as it addresses limitations of existing style transfer methods that rely on explicit style definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) which often result in entangled representations. The proposed method achieves superior style transfer results with better content preservation, fine style details, and flexible disentanglement control.

How

The authors introduce a diffusion-based style removal module to extract domain-aligned content information and a diffusion-based style transfer module to learn disentangled style from a single style image. A CLIP-based style disentanglement loss, combined with a style reconstruction prior, is used to guide the learning process in the CLIP image space.

Result

StyleDiffusion demonstrates impressive qualitative and quantitative results, outperforming SOTA methods in terms of content preservation (SSIM), style similarity (CLIP Score), and user preference. The framework offers flexible control over content-style disentanglement and trade-off at both training and testing stages by adjusting diffusion model parameters. It also exhibits potential for extensions such as photo-realistic style transfer, multi-modal style manipulation, and diversified style transfer.

LF

Limitations include the requirement for fine-tuning for each style, relatively slower inference due to diffusion models, and some failure cases like vanishing salient content or biased color distribution. Future work includes exploring arbitrary style transfer, accelerating diffusion sampling, and addressing the identified failure cases. Additionally, applying the framework to other image translation and manipulation tasks is another potential direction.

Abstract

Content and style (C-S) disentanglement is a fundamental problem and critical challenge of style transfer. Existing approaches based on explicit definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable nor easy to control, resulting in entangled representations and less satisfying results. In this paper, we propose a new C-S disentangled framework for style transfer without using previous assumptions. The key insight is to explicitly extract the content information and implicitly learn the complementary style information, yielding interpretable and controllable C-S disentanglement and style transfer. A simple yet effective CLIP-based style disentanglement loss coordinated with a style reconstruction prior is introduced to disentangle C-S in the CLIP image space. By further leveraging the powerful style removal and generative ability of diffusion models, our framework achieves superior results than state of the art and flexible C-S disentanglement and trade-off control. Our work provides new insights into the C-S disentanglement in style transfer and demonstrates the potential of diffusion models for learning well-disentangled C-S characteristics.