Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt
Authors: Zhiqi Huang, Huixin Xiong, Haoyu Wang, Longguang Wang, Zhiheng Li
What
This paper introduces Mask-ControlNet, a novel framework for enhancing text-to-image generation quality using an additional mask prompt, aiming to improve object fidelity and foreground-background harmony.
Why
This research is important because it addresses the limitations of existing text-to-image generation models in accurately replicating objects from reference images, particularly in complex compositions, and proposes a solution to enhance image quality and controllability.
How
The authors propose a two-stage framework: 1) Training phase: They train a diffusion model with a combination of text prompts, reference images, and object masks extracted using SAM. The model learns to generate images conditioned on these inputs. 2) Inference phase: Given a reference image and a text prompt, SAM segments the object, and the model generates an image adhering to the text prompt while maintaining fidelity to the segmented object.
Result
The paper shows that using mask prompts leads to: - Improved object fidelity, preserving details and reducing distortions. - Better handling of complex foreground-background relationships, resulting in more harmonious compositions. - Quantitatively, Mask-ControlNet outperforms existing methods in FID, PSNR, SSIM, LPIPS, CLIP, and DINO scores. - Qualitatively, generated images exhibit higher visual quality and realism, as confirmed by user studies.
LF
The paper does not explicitly mention limitations or future work. However, potential areas for improvement include: - Exploring different mask generation techniques beyond SAM to handle more complex scenes and object boundaries. - Investigating the generalization ability of the model to unseen object categories and diverse datasets. - Extending the framework to allow for more fine-grained control over object placement and relationships within the generated image.
Abstract
Text-to-image generation has witnessed great progress, especially with the recent advancements in diffusion models. Since texts cannot provide detailed conditions like object appearance, reference images are usually leveraged for the control of objects in the generated images. However, existing methods still suffer limited accuracy when the relationship between the foreground and background is complicated. To address this issue, we develop a framework termed Mask-ControlNet by introducing an additional mask prompt. Specifically, we first employ large vision models to obtain masks to segment the objects of interest in the reference image. Then, the object images are employed as additional prompts to facilitate the diffusion model to better understand the relationship between foreground and background regions during image generation. Experiments show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image while achieving better image quality. Comparison with previous text-to-image generation methods demonstrates our method’s superior quantitative and qualitative performance on the benchmark datasets.