DiffHarmony: Latent Diffusion Model Meets Image Harmonization

Authors: Pengfei Zhou, Fangxiang Feng, Xiaojie Wang

What

This paper introduces DiffHarmony, a novel image harmonization method that leverages a pre-trained latent diffusion model (Stable Diffusion) to generate harmonious images, enhanced by higher-resolution inference and a refinement stage to mitigate image distortion caused by the inherent compression in latent diffusion models.

Why

This paper is significant because it addresses the limitations of applying pre-trained latent diffusion models to image harmonization, particularly the reconstruction errors due to image compression. It offers a novel approach to achieve state-of-the-art results on image harmonization tasks by effectively adapting and enhancing the capabilities of pre-trained latent diffusion models.

How

The authors adapt a pre-trained Stable Diffusion model for image harmonization by incorporating composite images and foreground masks as input conditions. To mitigate image distortion, they employ two strategies: using higher-resolution images during inference and adding a refinement stage using a UNet model. The method is evaluated on the iHarmony4 dataset using PSNR, MSE, and fMSE metrics and compared with other state-of-the-art methods.

Result

DiffHarmony achieves state-of-the-art results on the iHarmony4 dataset, demonstrating the effectiveness of the proposed approach. Notably, the method excels in harmonizing images with larger foreground regions. Higher-resolution inference significantly improves performance, and the refinement stage further enhances the quality of generated images. Additionally, the authors conducted an ablation study to analyze the contribution of each component and performed an advanced analysis comparing their method with a state-of-the-art model trained on higher-resolution images.

LF

The authors acknowledge that their method’s performance on images with small foreground regions requires further investigation. Future work could explore using even higher image resolutions or employing better pre-trained diffusion models to address the limitations of information compression. Additionally, exploring alternative refinement techniques or more advanced network architectures for the refinement stage could lead to further improvements.

Abstract

Image harmonization, which involves adjusting the foreground of a composite image to attain a unified visual consistency with the background, can be conceptualized as an image-to-image translation task. Diffusion models have recently promoted the rapid development of image-to-image translation tasks . However, training diffusion models from scratch is computationally intensive. Fine-tuning pre-trained latent diffusion models entails dealing with the reconstruction error induced by the image compression autoencoder, making it unsuitable for image generation tasks that involve pixel-level evaluation metrics. To deal with these issues, in this paper, we first adapt a pre-trained latent diffusion model to the image harmonization task to generate the harmonious but potentially blurry initial images. Then we implement two strategies: utilizing higher-resolution images during inference and incorporating an additional refinement stage, to further enhance the clarity of the initially harmonized images. Extensive experiments on iHarmony4 datasets demonstrate the superiority of our proposed method. The code and model will be made publicly available at https://github.com/nicecv/DiffHarmony .