Transparent Image Layer Diffusion using Latent Transparency
Authors: Lvmin Zhang, Maneesh Agrawala
What
This paper introduces LayerDiffuse, a novel approach that enables large-scale pretrained latent diffusion models to generate transparent images, either as single entities or multiple transparent layers, by encoding transparency as a latent offset in the model’s latent space.
Why
This paper is significant because it addresses the lack of research in generating transparent images and layered content despite its high demand in visual content editing. It achieves this by tackling the challenges of limited training data and the sensitivity of pretrained diffusion models to alterations in their latent space representation.
How
The authors develop ‘latent transparency,’ a method that encodes alpha channel transparency into the latent space of a pretrained diffusion model (Stable Diffusion) without disrupting its latent distribution. They train their model using a human-in-the-loop scheme to collect a dataset of 1 million transparent image layer pairs, using GPT models to generate diverse and semantically related prompts for foreground and background layers.
Result
LayerDiffuse successfully generates high-quality transparent images and layers, as demonstrated through qualitative results and a user study. Users significantly preferred LayerDiffuse’s native transparency over conventional generation-then-matting methods, with its quality being comparable to commercial transparent image assets.
LF
The authors acknowledge a limitation in balancing the generation of ‘clean transparent elements’ and their ‘harmonious blending,’ particularly when dealing with reusable elements devoid of specific illumination effects. They suggest exploring improved methods for harmonious blending as future work.
Abstract
We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a “latent transparency” that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.