2310.03739 |
Aligning Text-to-Image Diffusion Models with Reward Backpropagation |
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki |
Text-to-image diffusion models have recently emerged at the forefront of
image generation, powered by very large-scale unsupervised or weakly supervised
text-to-image training datasets. Due to their unsupervised training,
controlling their behavior in downstream tasks, such as maximizing
human-perceived image quality, image-text alignment, or ethical image
generation, is difficult. Recent works finetune diffusion models to downstream
reward functions using vanilla reinforcement learning, notorious for the high
variance of the gradient estimators. In this paper, we propose AlignProp, a
method that aligns diffusion models to downstream reward functions using
end-to-end backpropagation of the reward gradient through the denoising
process. While naive implementation of such backpropagation would require
prohibitive memory resources for storing the partial derivatives of modern
text-to-image models, AlignProp finetunes low-rank adapter weight modules and
uses gradient checkpointing, to render its memory usage viable. We test
AlignProp in finetuning diffusion models to various objectives, such as
image-text semantic alignment, aesthetics, compressibility and controllability
of the number of objects present, as well as their combinations. We show
AlignProp achieves higher rewards in fewer training steps than alternatives,
while being conceptually simpler, making it a straightforward choice for
optimizing diffusion models for differentiable reward functions of interest.
Code and Visualization results are available at https://align-prop.github.io/. |
This paper introduces AlignProp, a novel method for aligning text-to-image diffusion models with specific reward functions using end-to-end backpropagation through the denoising process, overcoming memory constraints with techniques like LoRA and gradient checkpointing. |
This work is important because it provides a more efficient and effective way to adapt pre-trained diffusion models for specific downstream tasks that require optimizing for objectives like aesthetics, semantic alignment, or ethical image generation, which are difficult to achieve with standard training methods. |
The authors frame denoising inference as a differentiable recurrent policy and train it using end-to-end backpropagation of gradients from a reward function. To handle memory issues, they fine-tune low-rank adapter (LoRA) weights and employ gradient checkpointing. To prevent overfitting to the reward function, they introduce randomized truncated backpropagation through time. |
AlignProp achieves higher reward scores and converges faster than reinforcement learning baselines like DDPO. It also demonstrates better generalization to new prompts and is preferred by human evaluators for fidelity and image-text alignment. The paper shows that mixing weights of models finetuned on different reward functions allows for interpolation between these objectives. |
The authors acknowledge the limitation of potential over-optimization when the reward function is imperfect and suggest that mitigating this risk is an area for future work. Additionally, extending AlignProp to diffusion-based language models for improved alignment with human feedback is another promising direction. |
diffusion_model, alignment, image_generation, reward_function, backpropagation, lora, gradient_checkpointing, text-to-image, human_evaluation, generalization |
2404.18928 |
Stylus: Automatic Adapter Selection for Diffusion Models |
Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica |
Beyond scaling base models with more data or parameters, fine-tuned adapters
provide an alternative way to generate high fidelity, custom images at reduced
costs. As such, adapters have been widely adopted by open-source communities,
accumulating a database of over 100K adapters-most of which are highly
customized with insufficient descriptions. This paper explores the problem of
matching the prompt to a set of relevant adapters, built on recent work that
highlight the performance gains of composing adapters. We introduce Stylus,
which efficiently selects and automatically composes task-specific adapters
based on a prompt's keywords. Stylus outlines a three-stage approach that first
summarizes adapters with improved descriptions and embeddings, retrieves
relevant adapters, and then further assembles adapters based on prompts'
keywords by checking how well they fit the prompt. To evaluate Stylus, we
developed StylusDocs, a curated dataset featuring 75K adapters with
pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion
checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as
preferred, with humans and multimodal models as evaluators, over the base
model. See stylus-diffusion.github.io for more. |
This paper introduces Stylus, a system designed to automatically select and compose fine-tuned adapters for Stable Diffusion models based on user prompts to enhance image quality and diversity. |
This paper addresses the challenge of leveraging the vast and growing number of publicly available adapters for Stable Diffusion, which are often poorly documented and require manual selection. Stylus automates this process, making it easier for users to generate high-quality images by automatically identifying and combining relevant adapters based on the prompt, leading to improvements in visual fidelity, textual alignment, and image diversity. |
Stylus utilizes a three-stage framework: 1) **Refiner**: Employs a VLM to process adapter model cards and generate improved textual descriptions and embeddings for each adapter. 2) **Retriever**: Retrieves candidate adapters relevant to the user prompt by calculating cosine similarity scores between the prompt embedding and adapter embeddings. 3) **Composer**: Segments the prompt into keywords representing distinct tasks and assigns relevant adapters to each task using a long-context LLM, effectively filtering irrelevant adapters. Additionally, a masking strategy ensures diversity by applying different adapter combinations for a single prompt. |
Stylus demonstrates significant improvements over baseline Stable Diffusion models and alternative retrieval methods. Key results include: - Achieves a higher preference score (2:1) compared to baseline models in human evaluations. - Demonstrates better CLIP/FID Pareto efficiency, indicating superior visual fidelity and textual alignment. - Generates more diverse images per prompt, as evidenced by quantitative metrics (dFId) and VLM-based assessments. - Proves effective for various image-to-image tasks, including image translation and inpainting. |
The paper acknowledges limitations and suggests areas for future work: - **Task Blocking**: Composer may not fully prevent adapters from overriding existing concepts within the prompt. - **Task Diversity**: Merging adapters may reduce diversity in generating instances of a single task. - **Low-quality Adapters**: Blacklisting low-quality adapters is challenging, and some might still be selected. - **Retrieval Errors**: Refiner and Composer may introduce errors, leading to suboptimal adapter choices. Future work could explore: - Developing more robust solutions to address task blocking and diversity. - Improving the accuracy and efficiency of the Refiner and Composer components. - Investigating alternative masking schemes for enhanced diversity. |
diffusion_model, adapter, llm, analysis, image_generation, retrieval, lora |
2308.14328 |
Reinforcement Learning for Generative AI: A Survey |
Yuanjiang Cao, Quan Z. Sheng, Julian McAuley, Lina Yao |
Deep Generative AI has been a long-standing essential topic in the machine
learning community, which can impact a number of application areas like text
generation and computer vision. The major paradigm to train a generative model
is maximum likelihood estimation, which pushes the learner to capture and
approximate the target data distribution by decreasing the divergence between
the model distribution and the target distribution. This formulation
successfully establishes the objective of generative tasks, while it is
incapable of satisfying all the requirements that a user might expect from a
generative model. Reinforcement learning, serving as a competitive option to
inject new training signals by creating new objectives that exploit novel
signals, has demonstrated its power and flexibility to incorporate human
inductive bias from multiple angles, such as adversarial learning,
hand-designed rules and learned reward model to build a performant model.
Thereby, reinforcement learning has become a trending research field and has
stretched the limits of generative AI in both model design and application. It
is reasonable to summarize and conclude advances in recent years with a
comprehensive review. Although there are surveys in different application areas
recently, this survey aims to shed light on a high-level review that spans a
range of application areas. We provide a rigorous taxonomy in this area and
make sufficient coverage on various models and applications. Notably, we also
surveyed the fast-developing large language model area. We conclude this survey
by showing the potential directions that might tackle the limit of current
models and expand the frontiers for generative AI. |
This paper presents a comprehensive survey of how reinforcement learning (RL) is used in generative AI, analyzing its benefits, challenges, and applications across various domains. |
This survey is important because it provides a structured overview of a rapidly developing field that bridges reinforcement learning and generative AI, offering insights for both newcomers and experienced researchers to understand current progress and future directions. |
The authors reviewed a wide range of papers published in top conferences and journals, categorizing them based on how RL is used in generative tasks. They focused on applications involving sequential data generation, such as text, code, and molecules. |
The survey highlights that RL is beneficial for handling non-differentiable objectives, introducing new training signals, improving sampling in energy-based models, and automating neural architecture search. The authors also identify challenges like peaked distributions, exploration-exploitation trade-offs, sparse rewards, long-term credit assignment, and generalization. |
The paper points out several future research avenues, including reward function design for multi-objective optimization, model enhancement and control with RL, more sophisticated human preference modeling, addressing sample efficiency and generalization issues, incorporating novel RL algorithms, and understanding the implications of LLMs and foundation models. |
reinforcement_learning, generative_ai, survey, text_generation, code_generation, molecule_design, natural_language_processing, computer_vision, neural_architecture_search, diffusion_model |
2312.09168 |
DiffusionLight: Light Probes for Free by Painting a Chrome Ball |
Pakkapon Phongthawee, Worameth Chinchuthakun, Nontaphat Sinsunthithet, Amit Raj, Varun Jampani, Pramook Khungurn, Supasorn Suwajanakorn |
We present a simple yet effective technique to estimate lighting in a single
input image. Current techniques rely heavily on HDR panorama datasets to train
neural networks to regress an input with limited field-of-view to a full
environment map. However, these approaches often struggle with real-world,
uncontrolled settings due to the limited diversity and size of their datasets.
To address this problem, we leverage diffusion models trained on billions of
standard images to render a chrome ball into the input image. Despite its
simplicity, this task remains challenging: the diffusion models often insert
incorrect or inconsistent objects and cannot readily generate images in HDR
format. Our research uncovers a surprising relationship between the appearance
of chrome balls and the initial diffusion noise map, which we utilize to
consistently generate high-quality chrome balls. We further fine-tune an LDR
diffusion model (Stable Diffusion XL) with LoRA, enabling it to perform
exposure bracketing for HDR light estimation. Our method produces convincing
light estimates across diverse settings and demonstrates superior
generalization to in-the-wild scenarios. |
This paper introduces a novel technique, DiffusionLight, for estimating high dynamic range (HDR) lighting from a single image by leveraging pre-trained text-to-image diffusion models to inpaint a chrome ball into the scene and subsequently unwrapping its reflection to obtain an environment map. |
The paper addresses the limitations of current lighting estimation methods that rely on limited HDR panorama datasets, resulting in poor generalization to real-world, uncontrolled settings. By harnessing the vast image prior of diffusion models trained on billions of standard images, DiffusionLight demonstrates superior generalization and handles diverse in-the-wild scenarios effectively. |
The authors utilize a depth-conditioned Stable Diffusion XL model to inpaint chrome balls, addressing the challenge of generating high-quality reflections. They introduce an iterative inpainting algorithm to locate suitable initial noise maps for consistent ball generation. For HDR prediction, they fine-tune the model with LoRA to perform exposure bracketing, generating multiple LDR chrome balls at varying exposures which are then merged to produce a linearized HDR output. |
DiffusionLight achieves competitive results on standard benchmarks (Laval Indoor and Poly Haven), outperforming StyleLight in terms of Angular Error and Normalized RMSE. Notably, it exhibits strong generalization to in-the-wild images where existing methods struggle. The ablation study confirms the contribution of both the iterative inpainting algorithm and LoRA fine-tuning for improved performance. |
The paper acknowledges limitations such as the assumption of orthographic projection due to unknown camera parameters, occasional failure to reflect environments in overhead images, and the current slow processing time due to diffusion sampling. Future work includes addressing perspective projection, handling overhead views, and exploring faster sampling-efficient diffusion models. |
diffusion_model, light_estimation, hdr, inpainting, lora, environment_map, generalization, in-the-wild |
2312.09187 |
Vision-Language Models as a Source of Rewards |
Kate Baumli, Satinder Baveja, Feryal Behbahani, Harris Chan, Gheorghe Comanici, Sebastian Flennerhag, Maxime Gazeau, Kristian Holsheimer, Dan Horgan, Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney, Volodymyr Mnih, Alexander Neitz, Fabio Pardo, Jack Parker-Holder, John Quan, Tim Rocktäschel, Himanshu Sahni, Tom Schaul, Yannick Schroecker, Stephen Spencer, Richie Steigerwald, Luyu Wang, Lei Zhang |
Building generalist agents that can accomplish many goals in rich open-ended
environments is one of the research frontiers for reinforcement learning. A key
limiting factor for building generalist agents with RL has been the need for a
large number of reward functions for achieving different goals. We investigate
the feasibility of using off-the-shelf vision-language models, or VLMs, as
sources of rewards for reinforcement learning agents. We show how rewards for
visual achievement of a variety of language goals can be derived from the CLIP
family of models, and used to train RL agents that can achieve a variety of
language goals. We showcase this approach in two distinct visual domains and
present a scaling trend showing how larger VLMs lead to more accurate rewards
for visual goal achievement, which in turn produces more capable RL agents. |
This paper investigates the use of off-the-shelf vision-language models (VLMs), specifically CLIP, as reward functions for reinforcement learning agents in visual environments, enabling them to achieve language-specified goals. |
This paper is important because it addresses a key challenge in building generalist RL agents: the need for numerous, manually-designed reward functions. Using VLMs as reward generators has the potential to significantly improve the scalability and efficiency of training agents that can perform diverse tasks in complex environments. |
The authors propose a method to derive a binary reward signal from CLIP by: (1) computing the probability of goal achievement based on cosine similarity between image and text embeddings, and (2) thresholding this probability. They then use this reward to train RL agents in two visual domains: Playhouse and AndroidEnv, evaluating the agent's performance on achieving various language-specified goals. |
The key findings suggest that maximizing the VLM-derived reward leads to an improvement in ground truth reward, indicating the effectiveness of VLMs as reward functions. The authors also show that larger VLMs lead to more accurate rewards and subsequently better agent performance. Furthermore, they demonstrate the importance of prompt engineering in improving the performance of the VLM reward model. |
The paper acknowledges limitations regarding the potential for reward hacking, which was not observed within the scope of their experiments. Future work could explore generalizing negative sampling from generative distributions, such as LLMs. Additionally, exploring the impact of VLM advancements on training generalist agents without domain-specific fine-tuning is suggested. |
diffusion_model, llm, rl, vision-language model, reward function, clip, playhouse, androidenv, prompt engineering |
2308.10916 |
Diffusion Model as Representation Learner |
Xingyi Yang, Xinchao Wang |
Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive
results on various generative tasks.Despite its promises, the learned
representations of pre-trained DPMs, however, have not been fully understood.
In this paper, we conduct an in-depth investigation of the representation power
of DPMs, and propose a novel knowledge transfer method that leverages the
knowledge acquired by generative DPMs for recognition tasks. Our study begins
by examining the feature space of DPMs, revealing that DPMs are inherently
denoising autoencoders that balance the representation learning with
regularizing model capacity. To this end, we introduce a novel knowledge
transfer paradigm named RepFusion. Our paradigm extracts representations at
different time steps from off-the-shelf DPMs and dynamically employs them as
supervision for student networks, in which the optimal time is determined
through reinforcement learning. We evaluate our approach on several image
classification, semantic segmentation, and landmark detection benchmarks, and
demonstrate that it outperforms state-of-the-art methods. Our results uncover
the potential of DPMs as a powerful tool for representation learning and
provide insights into the usefulness of generative models beyond sample
generation. The code is available at
\url{https://github.com/Adamdad/Repfusion}. |
This paper investigates the potential of Diffusion Probabilistic Models (DPMs) for representation learning and proposes RepFusion, a novel knowledge transfer method that leverages pre-trained DPMs to enhance performance in recognition tasks like image classification and semantic segmentation. |
This paper is important because it explores the under-utilized representation learning capability of DPMs, going beyond their traditional generative applications. It offers a new perspective on leveraging pre-trained generative models for improved performance in discriminative tasks. |
The authors first establish a theoretical connection between DPMs and denoising autoencoders, demonstrating the time-dependent nature of DPM latent space. They then introduce RepFusion, which uses reinforcement learning to dynamically select optimal time steps for distilling knowledge from a pre-trained DPM into a student network. This student network is then fine-tuned for specific recognition tasks. |
RepFusion consistently outperforms baseline models and other self-supervised learning methods on various benchmarks, including CIFAR-10, Tiny-ImageNet, CelebAMask-HQ, and WFLW. Notably, it shows significant improvements in semantic segmentation, particularly in challenging scenarios with large pose variations and occlusions. |
The paper acknowledges the limitations of existing work on utilizing DPMs for representation learning, such as complex model modifications. As future work, the authors suggest exploring the time-step selection strategy further. Additionally, they highlight the need for a deeper understanding of the relationship between the chosen time step and the specific downstream task. |
diffusion_model, representation_learning, knowledge_distillation, semantic_segmentation, image_classification, landmark_detection, reinforcement_learning, analysis |
2310.03502 |
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion |
Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov |
Text-to-image generation is a significant domain in modern computer vision
and has achieved substantial improvements through the evolution of generative
architectures. Among these, there are diffusion-based models that have
demonstrated essential quality enhancements. These models are generally split
into two categories: pixel-level and latent-level approaches. We present
Kandinsky1, a novel exploration of latent diffusion architecture, combining the
principles of the image prior models with latent diffusion techniques. The
image prior model is trained separately to map text embeddings to image
embeddings of CLIP. Another distinct feature of the proposed model is the
modified MoVQ implementation, which serves as the image autoencoder component.
Overall, the designed model contains 3.3B parameters. We also deployed a
user-friendly demo system that supports diverse generative modes such as
text-to-image generation, image fusion, text and image fusion, image variations
generation, and text-guided inpainting/outpainting. Additionally, we released
the source code and checkpoints for the Kandinsky models. Experimental
evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking
our model as the top open-source performer in terms of measurable image
generation quality. |
This paper introduces Kandinsky, a novel text-to-image generation model based on latent diffusion architecture, combining image prior models with latent diffusion techniques, and demonstrates its capabilities through various generation modes and state-of-the-art performance on image generation quality. |
This paper is important because it presents a novel approach to text-to-image generation using a combination of image prior and latent diffusion, achieves state-of-the-art performance on image generation quality, and provides a fully open-source implementation of the model and user-friendly tools like a web application and Telegram bot, making it accessible for various applications. |
The authors developed Kandinsky by training an image prior model to map text embeddings to image embeddings of CLIP and utilizing a modified MoVQ implementation as the image autoencoder component. They conducted experiments on the COCO-30K dataset using FID-CLIP curves and human evaluation to assess the performance of different configurations, including various image prior setups and the effect of latent quantization. |
Kandinsky achieved a FID score of 8.03 on the COCO-30K dataset, making it the top open-source performer in terms of measurable image generation quality. The study found that a simple linear mapping for image prior yielded the best FID score, suggesting a potential linear relationship between visual and textual embedding spaces. Additionally, quantization of latent codes in MoVQ slightly improved image quality. |
Limitations mentioned include the need for further research to enhance the semantic coherence between text and generated images and improve FID scores and image quality based on human evaluation. Future work will focus on exploring newer image encoders, developing more efficient UNet architectures, improving text prompt understanding, generating higher-resolution images, and investigating new features like local image editing and addressing the potential for generating harmful content. |
diffusion_model, text-to-image, image_generation, image_prior, latent_diffusion, movq, clip, fid, open-source, web_application, telegram_bot |
2401.01335 |
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models |
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu |
Harnessing the power of human-annotated data through Supervised Fine-Tuning
(SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we
delve into the prospect of growing a strong LLM out of a weak one without the
need for acquiring additional human-annotated data. We propose a new
fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a
supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism,
where the LLM refines its capability by playing against instances of itself.
More specifically, the LLM generates its own training data from its previous
iterations, refining its policy by discerning these self-generated responses
from those obtained from human-annotated data. Our method progressively
elevates the LLM from a nascent model to a formidable one, unlocking the full
potential of human-annotated demonstration data for SFT. Theoretically, we
prove that the global optimum to the training objective function of our method
is achieved only when the LLM policy aligns with the target data distribution.
Empirically, we evaluate our method on several benchmark datasets including the
HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our
results show that SPIN can significantly improve the LLM's performance across a
variety of benchmarks and even outperform models trained through direct
preference optimization (DPO) supplemented with extra GPT-4 preference data.
This sheds light on the promise of self-play, enabling the achievement of
human-level performance in LLMs without the need for expert opponents. Codes
are available at https://github.com/uclaml/SPIN. |
This paper proposes a new fine-tuning method called Self-Play fine-tuning (SPIN) for Large Language Models (LLMs) that leverages a self-play mechanism to improve a model's performance without requiring additional human-annotated data. |
This paper is important because it offers a way to enhance LLM performance without the need for expensive and time-consuming data annotation beyond the initial fine-tuning dataset. It provides a theoretical analysis of the method's convergence and demonstrates its empirical effectiveness on various benchmark datasets. |
The authors propose a self-play mechanism where an LLM acts as both the main player and the opponent. The main player is trained to distinguish between responses generated by the opponent (an older version of the LLM) and human-annotated data. This iterative process refines the LLM's ability to generate responses aligned with the target data distribution. |
The paper shows SPIN significantly improves LLM performance on benchmarks like HuggingFace Open LLM Leaderboard and MT-Bench. Notably, SPIN outperforms methods like Direct Preference Optimization (DPO), which requires additional preference data, and achieves comparable results even at iteration 0. The paper also demonstrates the importance of iterative training and analyzes the impact of training data size. |
The paper acknowledges a limitation in that the fixed target data distribution, derived from humans, limits the potential performance. Future work could explore dynamically changing target distributions to push LLM capabilities beyond human-level. Additionally, the authors suggest exploring methods to reduce the volume of synthetic data needed for training. |
llm, fine-tuning, self-play, sft, dpo, analysis, benchmark |
2312.06663 |
CAD: Photorealistic 3D Generation via Adversarial Distillation |
Ziyu Wan, Despoina Paschalidou, Ian Huang, Hongyu Liu, Bokui Shen, Xiaoyu Xiang, Jing Liao, Leonidas Guibas |
The increased demand for 3D data in AR/VR, robotics and gaming applications,
gave rise to powerful generative pipelines capable of synthesizing high-quality
3D objects. Most of these models rely on the Score Distillation Sampling (SDS)
algorithm to optimize a 3D representation such that the rendered image
maintains a high likelihood as evaluated by a pre-trained diffusion model.
However, finding a correct mode in the high-dimensional distribution produced
by the diffusion model is challenging and often leads to issues such as
over-saturation, over-smoothing, and Janus-like artifacts. In this paper, we
propose a novel learning paradigm for 3D synthesis that utilizes pre-trained
diffusion models. Instead of focusing on mode-seeking, our method directly
models the distribution discrepancy between multi-view renderings and diffusion
priors in an adversarial manner, which unlocks the generation of high-fidelity
and photorealistic 3D content, conditioned on a single image and prompt.
Moreover, by harnessing the latent space of GANs and expressive diffusion model
priors, our method facilitates a wide variety of 3D applications including
single-view reconstruction, high diversity generation and continuous 3D
interpolation in the open domain. The experiments demonstrate the superiority
of our pipeline compared to previous works in terms of generation quality and
diversity. |
This paper introduces Consistent Adversarial Distillation (CAD), a novel method for synthesizing high-quality, photorealistic 3D objects from a single image and text prompt by leveraging pre-trained 2D diffusion models and addressing limitations of existing score distillation methods. |
This work is important because it overcomes limitations of previous 3D generation techniques, such as over-saturation, over-smoothing, and limited diversity, by directly modeling the distribution of a pre-trained diffusion model through adversarial learning, leading to higher quality and more diverse 3D object synthesis. |
The authors propose a framework that uses a StyleGAN2-based generator to model the 3D distribution of objects, trained adversarially against a discriminator to match the distribution of a pre-trained 2D diffusion model. To ensure multi-view consistency and high-fidelity generation, they employ a two-stage training process with 2D and 3D upsampling branches, a camera pose pruning strategy for filtering inconsistent samples, and a distribution refinement step using additional diffusion models. |
CAD generates high-fidelity 3D objects with photorealistic textures and fewer artifacts compared to existing methods like DreamFusion, ProlificDreamer, Magic123, and Zero-1-to-3. It also demonstrates superior performance in quantitative metrics like CLIP similarity score and qualitative evaluations including a user study, highlighting its ability to produce diverse and realistic 3D objects. |
The authors acknowledge limitations in optimization speed due to volumetric rendering and suggest exploring efficient rendering techniques like Gaussian Splatting. They also propose future work on enabling multi-conditional generation and extending CAD to handle scene-level synthesis. |
diffusion_model, gan, 3d, single-view reconstruction, photorealistic, adversarial_distillation |
2405.00760 |
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models |
Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, Hongsheng Li |
Optimizing a text-to-image diffusion model with a given reward function is an
important but underexplored research area. In this study, we propose Deep
Reward Tuning (DRTune), an algorithm that directly supervises the final output
image of a text-to-image diffusion model and back-propagates through the
iterative sampling process to the input noise. We find that training earlier
steps in the sampling process is crucial for low-level rewards, and deep
supervision can be achieved efficiently and effectively by stopping the
gradient of the denoising network input. DRTune is extensively evaluated on
various reward models. It consistently outperforms other algorithms,
particularly for low-level control signals, where all shallow supervision
methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0)
model via DRTune to optimize Human Preference Score v2.1, resulting in the
Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances
image quality compared to SDXL 1.0 and reaches comparable quality compared with
Midjourney v5.2. |
This paper presents DRTune, a novel algorithm for efficiently fine-tuning text-to-image diffusion models using deep reward supervision, enabling optimization based on various reward functions like image aesthetics and symmetry. |
This research is important because it addresses the challenge of optimizing diffusion models with complex reward functions, particularly those requiring deep supervision, which is crucial for controlling global image properties and improving generated image quality. |
The authors propose DRTune, which employs two key techniques: 1) stopping gradients at denoising network inputs to prevent gradient explosion during back-propagation, and 2) training a strategically sampled subset of denoising steps to improve training efficiency. They compare DRTune with existing reward training methods on a variety of reward functions, including aesthetic scores, CLIPScore, PickScore, symmetry, compressibility, and objectness. |
DRTune consistently outperforms baseline methods in optimizing various reward functions, particularly those demanding deep supervision for global image properties like symmetry. Additionally, the authors demonstrate the practical application of DRTune by fine-tuning Stable Diffusion XL 1.0 (SDXL 1.0) with the Human Preference Score v2.1 reward, creating Favorable Diffusion XL 1.0 (FDXL 1.0), which exhibits significantly improved image quality compared to SDXL 1.0 and even achieves comparable quality with Midjourney v5.2. |
The authors acknowledge the limitations of reward-based training, specifically the risk of reward hacking, where models might prioritize optimizing the reward function at the expense of overall image quality. They suggest exploring regularization techniques to mitigate this issue. Additionally, they recognize the potential negative social impact of advanced generative models, such as the creation of highly plausible misinformation and the amplification of biases present in the training data. Future work could focus on developing more robust reward functions and exploring methods to mitigate potential biases in training data. |
diffusion_model, reward, drtune, stable diffusion, image_generation, optimization, deep_learning, text-to-image |
2404.05674 |
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation |
Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang |
In this paper, we present MoMA: an open-vocabulary, training-free
personalized image model that boasts flexible zero-shot capabilities. As
foundational text-to-image models rapidly evolve, the demand for robust
image-to-image translation grows. Addressing this need, MoMA specializes in
subject-driven personalized image generation. Utilizing an open-source,
Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as
both a feature extractor and a generator. This approach effectively synergizes
reference image and text prompt information to produce valuable image features,
facilitating an image diffusion model. To better leverage the generated
features, we further introduce a novel self-attention shortcut method that
efficiently transfers image features to an image diffusion model, improving the
resemblance of the target object in generated images. Remarkably, as a
tuning-free plug-and-play module, our model requires only a single reference
image and outperforms existing methods in generating images with high detail
fidelity, enhanced identity-preservation and prompt faithfulness. Our work is
open-source, thereby providing universal access to these advancements. |
This paper introduces MoMA, an open-vocabulary and training-free personalized image generation model that excels in producing high-fidelity images with preserved object identity while adhering to text prompts. |
This paper addresses the limitations of existing personalized image generation methods that require extensive tuning, are confined to specific domains, or lack detail fidelity. MoMA offers a more efficient and versatile approach to personalization by leveraging the power of MLLMs, making it more accessible and applicable to a wider range of image generation tasks. |
MoMA employs a multi-modal LLM adapter with fine-grained feature transfer. It utilizes a generative multi-modal decoder to extract and modify image features from a reference image based on the target text prompt. It also extracts object features from a white-background version of the reference image using the UNet's self-attention layers. These features are then injected into a pre-trained UNet during image generation. The model is pre-trained in two stages: first, the multi-modal decoder is trained to generate contextualized image embeddings, and then the decoupled attention modules in the UNet are optimized. |
MoMA demonstrates superior detail accuracy and faithfulness to the target object across varied backgrounds in recontextualization tasks. For texture modification, it effectively alters the texture while preserving other visual features. Notably, it achieves this without per-instance tuning, making it efficient and readily applicable. Experiments show MoMA outperforms existing tuning-free methods both qualitatively and quantitatively in terms of detail fidelity, identity preservation, and prompt adherence. It also demonstrates generalizability by successfully integrating with various community-trained diffusion models. |
The paper acknowledges limitations in generating images with rare subjects or those containing text, where details might be lost. Future work could explore techniques to improve the model's ability to handle such cases. Additionally, the paper highlights the potential misuse of the model for creating deceptive content and suggests careful consideration and implementation of safeguards before widespread deployment. |
diffusion_model, mllm, personalization, image_generation, open-vocabulary, tuning-free, image-to-image, self-attention |
2309.07254 |
Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement |
Chenghao Li, Dake Chen, Yuke Zhang, Peter A. Beerel |
While diffusion models demonstrate a remarkable capability for generating
high-quality images, their tendency to `replicate' training data raises privacy
concerns. Although recent research suggests that this replication may stem from
the insufficient generalization of training data captions and duplication of
training images, effective mitigation strategies remain elusive. To address
this gap, our paper first introduces a generality score that measures the
caption generality and employ large language model (LLM) to generalize training
captions. Subsequently, we leverage generalized captions and propose a novel
dual fusion enhancement approach to mitigate the replication of diffusion
models. Our empirical results demonstrate that our proposed methods can
significantly reduce replication by 43.5% compared to the original diffusion
model while maintaining the diversity and quality of generations. Code is
available at https://github.com/HowardLi0816/dual-fusion-diffusion. |
This paper tackles the privacy issue of data replication in diffusion models by proposing a method to quantify caption generality and a novel dual fusion enhancement training approach. |
This paper is significant as it addresses the growing privacy concerns regarding diffusion models replicating training data, which is crucial for the responsible development and deployment of such models. |
The authors introduce a "generality score" to measure caption generality and utilize LLMs to generate more general captions. They then propose a dual fusion enhancement approach that fuses specific object features with the original image in latent space and combines corresponding label embeddings with the caption. They evaluate their methods by fine-tuning Stable Diffusion v2.1 on a subset of LAION-2B and measuring replication score and FID. |
The proposed method significantly reduces replication by 43.5% compared to the baseline and outperforms other mitigation strategies while maintaining comparable generation quality and diversity. The paper also shows that using generalized captions generated by LLMs effectively reduces replication. |
The paper acknowledges a trade-off between reducing replication and maintaining image generation quality. Future work includes exploring the use of the generality score to guide caption generalization and iteratively enhance caption generality. |
diffusion_model, privacy, data_replication, llm, caption_generation, generality, fusion, stable diffusion |
2309.05793 |
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models |
Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing Liu, Kang Du, Min Zheng |
Personalized text-to-image generation has emerged as a powerful and
sought-after tool, empowering users to create customized images based on their
specific concepts and prompts. However, existing approaches to personalization
encounter multiple challenges, including long tuning times, large storage
requirements, the necessity for multiple input images per identity, and
limitations in preserving identity and editability. To address these obstacles,
we present PhotoVerse, an innovative methodology that incorporates a
dual-branch conditioning mechanism in both text and image domains, providing
effective control over the image generation process. Furthermore, we introduce
facial identity loss as a novel component to enhance the preservation of
identity during training. Remarkably, our proposed PhotoVerse eliminates the
need for test time tuning and relies solely on a single facial photo of the
target identity, significantly reducing the resource cost associated with image
generation. After a single training phase, our approach enables generating
high-quality images within only a few seconds. Moreover, our method can produce
diverse images that encompass various scenes and styles. The extensive
evaluation demonstrates the superior performance of our approach, which
achieves the dual objectives of preserving identity and facilitating
editability. Project page: https://photoverse2d.github.io/ |
This paper introduces PhotoVerse, a novel method for personalized text-to-image generation that uses a dual-branch conditioning mechanism to enable fast generation and high-quality images using only a single reference image of the target identity. |
This paper addresses the limitations of existing personalized text-to-image generation methods, such as long tuning times, large storage requirements, and the need for multiple input images. It offers a faster and more user-friendly approach for incorporating specific individuals into diverse scenes with high fidelity. |
The paper proposes a dual-branch conditioning mechanism that combines improved identity textual embeddings and spatial concept cues through dual-modality adapters in both text and image domains. The method utilizes a pre-trained Stable Diffusion model and incorporates a novel facial identity loss component during training to enhance identity preservation. The approach employs lightweight adapters and fine-tunes only the cross-attention module of the UNet, resulting in fast and efficient personalization without the need for test-time tuning. |
PhotoVerse demonstrates superior performance in preserving identity attributes while enabling image editing, stylization, and new scene generation. It achieves high identity similarity across diverse ethnicities and produces high-quality images with sharp details and natural aesthetics. The method eliminates the need for test-time tuning and generates images in just a few seconds using a single reference image, significantly improving efficiency compared to existing methods. |
The authors acknowledge potential bias in pre-trained large models as a limitation. Future work could involve exploring methods to mitigate this bias and further enhance the generalization capabilities of the model. Additionally, incorporating control mechanisms for pose and composition could provide users with more fine-grained control over image generation. |
diffusion_model, text-to-image, personalization, identity_preservation, fast_generation, single_image, dual-branch_conditioning, adapter, facial_identity_loss, image_editing, stylization |
2312.08578 |
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions |
Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, Adriana Romero-Soriano |
Curation methods for massive vision-language datasets trade off between
dataset size and quality. However, even the highest quality of available
curated captions are far too short to capture the rich visual detail in an
image. To show the value of dense and highly-aligned image-text pairs, we
collect the Densely Captioned Images (DCI) dataset, containing 8012 natural
images human-annotated with mask-aligned descriptions averaging above 1000
words each. With precise and reliable captions associated with specific parts
of an image, we can evaluate vision-language models' (VLMs) understanding of
image content with a novel task that matches each caption with its
corresponding subcrop. As current models are often limited to 77 text tokens,
we also introduce a summarized version (sDCI) in which each caption length is
limited. We show that modern techniques that make progress on standard
benchmarks do not correspond with significant improvement on our sDCI based
benchmark. Lastly, we finetune CLIP using sDCI and show significant
improvements over the baseline despite a small training set. By releasing the
first human annotated dense image captioning dataset, we hope to enable the
development of new benchmarks or fine-tuning recipes for the next generation of
VLMs to come. |
The paper introduces the Densely Captioned Images (DCI) dataset, a collection of 8012 natural images with human-annotated, mask-aligned descriptions averaging over 1000 words each, enabling the evaluation of vision-language models' understanding of fine-grained image details. |
This paper is important because it addresses the limitations of existing vision-language datasets that rely on short, loosely-aligned captions, hindering the development and evaluation of models capable of deep visual-linguistic understanding. The introduction of DCI with its dense and aligned captions provides a valuable resource for benchmarking and advancing vision-language models. |
The authors first preprocessed images from the SA-1B dataset using the Segment Anything Model (SAM) to extract hierarchical submasks. Then, they employed a multi-stage crowdsourcing approach with qualification tasks and iterative feedback to ensure high-quality annotations. To fit existing model limitations, they used LLaMA2 to generate summarized captions and negatives within CLIP's token limit, resulting in the summarized DCI (sDCI) dataset. Finally, they evaluated several state-of-the-art VLMs on sDCI using novel benchmark tasks like Subcrop-Caption Matching (SCM) and negatives-based tests. |
The results show that existing VLMs, even those trained with negatives or dense captions, struggle to accurately match captions to corresponding subregions within an image, highlighting limitations in fine-grained understanding. Additionally, fine-tuning CLIP on sDCI significantly improved performance on benchmarks like ARO and VL-Checklist, outperforming models trained on significantly larger but loosely-aligned datasets like DAC. These findings underscore the importance of dense and aligned image-text pairs for effective VLM training. |
The authors acknowledge limitations in using LLM-generated summaries, which may not capture all the nuances of the full annotations, and the limited text context length of current VLMs. They suggest future work exploring models with larger context windows to leverage the full DCI dataset, and investigating techniques like bitext mining to expand the dataset further. |
diffusion_model, llm, analysis, 3d, adversarial_attack, interpretability |
2403.12143 |
Graph Neural Networks for Learning Equivariant Representations of Neural Networks |
Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J. Burghouts, Efstratios Gavves, Cees G. M. Snoek, David W. Zhang |
Neural networks that process the parameters of other neural networks find
applications in domains as diverse as classifying implicit neural
representations, generating neural network weights, and predicting
generalization errors. However, existing approaches either overlook the
inherent permutation symmetry in the neural network or rely on intricate
weight-sharing patterns to achieve equivariance, while ignoring the impact of
the network architecture itself. In this work, we propose to represent neural
networks as computational graphs of parameters, which allows us to harness
powerful graph neural networks and transformers that preserve permutation
symmetry. Consequently, our approach enables a single model to encode neural
computational graphs with diverse architectures. We showcase the effectiveness
of our method on a wide range of tasks, including classification and editing of
implicit neural representations, predicting generalization performance, and
learning to optimize, while consistently outperforming state-of-the-art
methods. The source code is open-sourced at
https://github.com/mkofinas/neural-graphs. |
This paper introduces a novel approach to representing neural networks as computational graphs of parameters called 'neural graphs', which allows for leveraging powerful graph neural networks and transformers while preserving permutation symmetry. |
This research is important because it addresses limitations of existing methods that process neural network parameters, such as overlooking inherent permutation symmetry or relying on complex weight-sharing patterns. By representing neural networks as graphs, this approach allows a single model to learn from diverse architectures and opens up new possibilities for applications like neural network analysis, generation, and optimization. |
The authors represent neural networks as graphs by mapping neurons to nodes and connections to edges, with weights and biases as edge and node features respectively. This representation is then used as input to graph neural networks (GNNs) or transformers, adapted to incorporate inductive biases from the neural graph structure. They validate their approach on various tasks including implicit neural representation classification and editing, predicting generalization performance of CNNs, and learning to optimize. |
The proposed neural graph approach consistently outperforms state-of-the-art methods on tasks like INR classification and style editing, showing significant improvement over previous methods like DWSNet and NFN. It also demonstrates superior performance in predicting CNN generalization, especially when dealing with diverse architectures where accounting for both parameters and architecture is crucial. Furthermore, the method shows promise in the field of learning to optimize, achieving strong performance on both validation and test tasks. |
The authors acknowledge limitations in terms of architectural diversity explored, focusing mainly on MLPs and CNNs. Future work could investigate the representation of other architectures like transformers. Additionally, the strong performance on INRs is currently limited to 2D images, and extending it to handle 3D representations like neural radiance fields is an area for further exploration. |
diffusion_model, gan, analysis, 3d, interpretability, neural_network, graph_neural_network, transformer, representation_learning, permutation_symmetry, implicit_neural_representation, generalization, learning_to_optimize |
2405.02730 |
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers |
Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang |
Diffusion Transformers (DiTs) introduce the transformer architecture to
diffusion tasks for latent-space image generation. With an isotropic
architecture that chains a series of transformer blocks, DiTs demonstrate
competitive performance and good scalability; but meanwhile, the abandonment of
U-Net by DiTs and their following improvements is worth rethinking. To this
end, we conduct a simple toy experiment by comparing a U-Net architectured DiT
with an isotropic one. It turns out that the U-Net architecture only gain a
slight advantage amid the U-Net inductive bias, indicating potential
redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net
backbone features are low-frequency-dominated, we perform token downsampling on
the query-key-value tuple for self-attention and bring further improvements
despite a considerable amount of reduction in computation. Based on
self-attention with downsampled tokens, we propose a series of U-shaped DiTs
(U-DiTs) in the paper and conduct extensive experiments to demonstrate the
extraordinary performance of U-DiT models. The proposed U-DiT could outperform
DiT-XL/2 with only 1/6 of its computation cost. Codes are available at
https://github.com/YuchuanTian/U-DiT. |
This paper introduces U-DiT, a U-Net architecture diffusion transformer model for latent-space image generation that leverages token downsampling in self-attention to improve performance and reduce computational cost compared to isotropic DiT models. |
The paper is important as it challenges the prevailing use of isotropic architectures in diffusion transformers by demonstrating the potential of U-Net architecture combined with a novel downsampled self-attention mechanism, leading to state-of-the-art performance with reduced computational costs. |
The authors conducted a toy experiment comparing a simple U-Net DiT with an isotropic DiT and found that while U-Net offered some benefits, it was underutilized. They then introduced downsampled self-attention, reducing redundancy by focusing on low-frequency components in the U-Net backbone. They scaled this model up, creating U-DiT and evaluating it against existing DiT models on ImageNet 256x256, measuring FID, sFID, IS, precision, and recall. |
U-DiT significantly outperforms isotropic DiTs, achieving better FID scores with fewer FLOPs. For example, U-DiT-B surpasses DiT-XL/2 in performance with only 1/6th of the computational cost. This highlights the efficacy of the U-Net architecture and downsampled self-attention for efficient and high-quality image generation. |
The authors acknowledge limitations in exploring the full potential of U-DiTs due to computational resource constraints and a tight schedule, suggesting further scaling of model size and extending training iterations as future work. |
diffusion_model, transformer, u-net, image_generation, latent_space, self-attention, downsampling, computational_efficiency |
2403.18978 |
TextCraftor: Your Text Encoder Can be Image Quality Controller |
Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, Jian Ren |
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have
revolutionized the field of content generation, enabling significant
advancements in areas like image editing and video synthesis. Despite their
formidable capabilities, these models are not without their limitations. It is
still challenging to synthesize an image that aligns well with the input text,
and multiple runs with carefully crafted prompts are required to achieve
satisfactory results. To mitigate these limitations, numerous studies have
endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing
various technologies. Yet, amidst these efforts, a pivotal question of
text-to-image diffusion model training has remained largely unexplored: Is it
possible and feasible to fine-tune the text encoder to improve the performance
of text-to-image diffusion models? Our findings reveal that, instead of
replacing the CLIP text encoder used in Stable Diffusion with other large
language models, we can enhance it through our proposed fine-tuning approach,
TextCraftor, leading to substantial improvements in quantitative benchmarks and
human assessments. Interestingly, our technique also empowers controllable
image generation through the interpolation of different text encoders
fine-tuned with various rewards. We also demonstrate that TextCraftor is
orthogonal to UNet finetuning, and can be combined to further improve
generative quality. |
This paper introduces TextCraftor, a novel method for enhancing text-to-image diffusion models by fine-tuning the text encoder using reward functions, leading to improved image quality and text-image alignment. |
This paper is important because it addresses the limitations of existing text-to-image diffusion models in generating images that accurately reflect input text prompts. It offers a more efficient alternative to replacing the entire text encoder or relying on manual prompt engineering, which are computationally expensive or require human effort. |
The authors propose two techniques: 1) Directly fine-tuning with reward: This involves using a reward model to directly assess the quality of images generated from noisy latents. 2) Prompt-based fine-tuning: This addresses limitations of the first technique by using the denoising process to obtain a more accurate final image for reward prediction. They utilize various reward functions like aesthetic scores, text-image alignment scores, and CLIP similarity to guide the fine-tuning process. |
TextCraftor significantly improves image quality and text-image alignment compared to baseline models like SDv1.5 and SDv2.0, even outperforming larger models like SDXL Base 0.9 and DeepFloyd-XL in some aspects. It achieves better quantitative scores on Parti-Prompts and HPSv2 benchmarks, and human evaluations confirm the superiority of generated images. TextCraftor also enables controllable image generation through interpolation of different fine-tuned text encoders. |
The authors acknowledge limitations in reward models and the potential for mode collapse. They suggest exploring encoding reward function styles into text encoder tokens as future work. |
diffusion_model, text-to-image, image_generation, text_encoder, fine-tuning, reward_function, controllable_generation |
2312.05239 |
SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation |
Thuan Hoang Nguyen, Anh Tran |
Despite their ability to generate high-resolution and diverse images from
text prompts, text-to-image diffusion models often suffer from slow iterative
sampling processes. Model distillation is one of the most effective directions
to accelerate these models. However, previous distillation methods fail to
retain the generation quality while requiring a significant amount of images
for training, either from real data or synthetically generated by the teacher
model. In response to this limitation, we present a novel image-free
distillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration from
text-to-3D synthesis, in which a 3D neural radiance field that aligns with the
input prompt can be obtained from a 2D text-to-image diffusion prior via a
specialized loss without the use of any 3D data ground-truth, our approach
re-purposes that same loss for distilling a pretrained multi-step text-to-image
model to a student network that can generate high-fidelity images with just a
single inference step. In spite of its simplicity, our model stands as one of
the first one-step text-to-image generators that can produce images of
comparable quality to Stable Diffusion without reliance on any training image
data. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and a
CLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitive
results or even substantially surpassing existing state-of-the-art distillation
techniques. |
This paper presents SwiftBrush, a novel image-free distillation method for text-to-image diffusion models that enables single-step, high-fidelity image generation without relying on any training image data. |
This paper is important because it addresses the slow inference speed of traditional text-to-image diffusion models by enabling single-step generation while maintaining high fidelity, which is crucial for deployment on consumer devices and broader accessibility. |
The authors draw inspiration from text-to-3D synthesis techniques and adapt Variational Score Distillation (VSD) for text-to-image generation. They employ a pretrained text-to-image teacher model and an additional trainable LoRA teacher model to guide the learning of a student model that can generate images from text prompts in a single step. The student model is trained without using any image data, relying solely on text captions and a specialized loss function. |
SwiftBrush achieves promising zero-shot results on benchmarks like COCO 2014 and Human Preference Score v2, surpassing existing one-step image generation methods in quality while being more efficient and requiring significantly less training time. Notably, SwiftBrush achieves an FID score of 16.67 and a CLIP score of 0.29 on the COCO-30K benchmark without using any training image data. |
The authors acknowledge that SwiftBrush, while efficient, may produce slightly lower quality images compared to multi-step teacher models. Future work could focus on extending SwiftBrush to support few-step generation, exploring single-teacher distillation, and integrating techniques like DreamBooth, ControlNet, or InstructPix2Pix for enhanced control and application. |
diffusion_model, distillation, text-to-image, one-step generation, image-free, gan, nerf, sds, vsd, lora |
2404.18861 |
A Survey on Vision Mamba: Models, Applications and Challenges |
Rui Xu, Shu Yang, Yihui Wang, Bo Du, Hao Chen |
Mamba, a recent selective structured state space model, performs excellently
on long sequence modeling tasks. Mamba mitigates the modeling constraints of
convolutional neural networks and offers advanced modeling capabilities similar
to those of Transformers, through global receptive fields and dynamic
weighting. Crucially, it achieves this without incurring the quadratic
computational complexity typically associated with Transformers. Due to its
advantages over the former two mainstream foundation models, Mamba exhibits
great potential to be a visual foundation model. Researchers are actively
applying Mamba to various computer vision tasks, leading to numerous emerging
works. To help keep pace with the rapid advancements in computer vision, this
paper aims to provide a comprehensive review of visual Mamba approaches. This
paper begins by delineating the formulation of the original Mamba model.
Subsequently, our review of visual Mamba delves into several representative
backbone networks to elucidate the core insights of the visual Mamba. We then
categorize related works using different modalities, including image, video,
point cloud, multi-modal, and others. Specifically, for image applications, we
further organize them into distinct tasks to facilitate a more structured
discussion. Finally, we discuss the challenges and future research directions
for visual Mamba, providing insights for future research in this quickly
evolving area. A comprehensive list of visual Mamba models reviewed in this
work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models. |
This paper presents a comprehensive survey of Vision Mamba, a novel and efficient neural network architecture for visual tasks, exploring its underlying principles, diverse applications across various visual domains, and outlining future research directions. |
This survey is important because it examines the rapid advancements and growing influence of Vision Mamba in the computer vision field, providing a timely and valuable resource for researchers to understand the core concepts, explore applications, and contribute to its ongoing development. |
The authors provide a structured analysis of Vision Mamba by first introducing its foundational principles, followed by in-depth examinations of representative backbone networks and categorizing its applications based on visual modalities such as image, video, multi-modal data, and point clouds. The paper concludes by critically analyzing the challenges and outlining future research directions. |
The paper highlights Vision Mamba's effectiveness in various computer vision tasks, including classification, segmentation, generation, and restoration, across diverse domains like medical imaging, remote sensing, and video understanding. It also provides insights into how different visual Mamba models address the unique characteristics of visual data and discusses their performance compared to traditional convolutional neural networks and Transformers. |
The paper identifies key limitations of Vision Mamba, including stability issues when scaling to large datasets, challenges in adapting causal scanning mechanisms to non-causal visual data, potential loss of spatial information during 1D scanning, information redundancy and increased computational demands due to multi-directional scanning, and the need for enhanced interpretability, generalization ability, and robustness. Future research directions include developing more efficient scanning techniques and fusion methods, optimizing computational efficiency, and exploring applications in data-efficient learning, high-resolution data analysis, multi-modal learning, and in-context learning. |
diffusion_model, llm, analysis, literature_review, 3d, motion, video, interpretability, vision_transformer, state_space_model |
2310.12036 |
A General Theoretical Paradigm to Understand Learning from Human Preferences |
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos |
The prevalent deployment of learning from human preferences through
reinforcement learning (RLHF) relies on two important approximations: the first
assumes that pairwise preferences can be substituted with pointwise rewards.
The second assumes that a reward model trained on these pointwise rewards can
generalize from collected data to out-of-distribution data sampled by the
policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an
approach that bypasses the second approximation and learn directly a policy
from collected data without the reward modelling stage. However, this method
still heavily relies on the first approximation.
In this paper we try to gain a deeper theoretical understanding of these
practical algorithms. In particular we derive a new general objective called
$\Psi$PO for learning from human preferences that is expressed in terms of
pairwise preferences and therefore bypasses both approximations. This new
general objective allows us to perform an in-depth analysis of the behavior of
RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential
pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$
simply to Identity, for which we can derive an efficient optimisation
procedure, prove performance guarantees and demonstrate its empirical
superiority to DPO on some illustrative examples. |
This paper presents a theoretical framework, called Ψ-preference optimization (ΨΠΟ), for learning from human preferences, unifying existing methods like RLHF and DPO and highlighting potential pitfalls. |
The paper addresses the lack of theoretical understanding of current preference learning methods, despite their practical success, particularly in aligning large language models with human preferences. |
The authors introduce ΨPO as a general objective function, analyze specific cases like RLHF and DPO, identify potential overfitting issues, and propose a simplified variant, Identity-ΠΟ (IΠΟ), with a computationally efficient algorithm. |
The paper shows that ΨPO generalizes RLHF and DPO, both vulnerable to overfitting due to their reliance on the Bradley-Terry model. The proposed IΠΟ method, using the identity mapping in ΨPO, avoids overfitting by directly optimizing regularized total preferences. Experiments on illustrative bandit examples demonstrate IΠΟ's improved stability and adherence to the reference policy compared to DPO. |
While the paper provides a theoretical analysis and illustrative examples, future work should focus on scaling up IΠΟ to more complex scenarios, such as training large language models on human preference data, to assess its real-world effectiveness. |
rlhf, dpo, llm, analysis, preference learning, overfitting, regularization, bandit, optimization |
2401.06805 |
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning |
Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang |
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence
(AGI) with abstract reasoning ability is the goal of next-generation AI. Recent
advancements in Large Language Models (LLMs), along with the emerging field of
Multimodal Large Language Models (MLLMs), have demonstrated impressive
capabilities across a wide range of multimodal tasks and applications.
Particularly, various MLLMs, each with distinct model architectures, training
data, and training stages, have been evaluated across a broad range of MLLM
benchmarks. These studies have, to varying degrees, revealed different aspects
of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs
have not been systematically investigated. In this survey, we comprehensively
review the existing evaluation protocols of multimodal reasoning, categorize
and illustrate the frontiers of MLLMs, introduce recent trends in applications
of MLLMs on reasoning-intensive tasks, and finally discuss current practices
and future directions. We believe our survey establishes a solid base and sheds
light on this important topic, multimodal reasoning. |
This paper surveys the current state of multimodal reasoning in Multimodal Large Language Models (MLLMs), exploring their architectures, training methods, and performance on various reasoning tasks. |
This paper is important because it provides a comprehensive overview of the rapidly developing field of MLLMs, focusing specifically on their reasoning abilities which are crucial for achieving artificial general intelligence. |
The authors reviewed existing literature on MLLMs, analyzed their architectures, training datasets, and performance on various reasoning benchmarks, and categorized the applications of these models. |
The paper highlights that while MLLMs have shown impressive capabilities in multimodal tasks, their reasoning abilities still lag behind proprietary models like GPT-4V. The authors identified key factors contributing to the superior performance of some MLLMs, including unfreezing the language model during training, improving visual representations, and utilizing multi-task supervised learning. |
The paper points out limitations in current MLLM architectures, training efficiency, long-context support, instruction fine-tuning data, and evaluation benchmarks. It suggests future research directions, including developing more robust architectures, efficient training methods, long-context support mechanisms, improved instruction datasets, and more comprehensive evaluation benchmarks. |
mllm, llm, multimodal_reasoning, instruction_tuning, in-context_learning, analysis, literature_review, embodied_ai, tool_usage |
2404.19227 |
Espresso: Robust Concept Filtering in Text-to-Image Models |
Anudeep Das, Vasisht Duddu, Rui Zhang, N. Asokan |
Diffusion-based text-to-image (T2I) models generate high-fidelity images for
given textual prompts. They are trained on large datasets scraped from the
Internet, potentially containing unacceptable concepts (e.g., copyright
infringing or unsafe). Retraining T2I models after filtering out unacceptable
concepts in the training data is inefficient and degrades utility. Hence, there
is a need for concept removal techniques (CRTs) which are effective in removing
unacceptable concepts, utility-preserving on acceptable concepts, and robust
against evasion with adversarial prompts. None of the prior filtering and
fine-tuning CRTs satisfy all these requirements simultaneously.
We introduce Espresso, the first robust concept filter based on Contrastive
Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by
projecting the generated image's embedding onto the vector connecting
unacceptable and acceptable concepts in the joint text-image embedding space.
This ensures robustness by restricting the adversary to adding noise only along
this vector, in the direction of the acceptable concept. Further fine-tuning
Espresso to separate embeddings of acceptable and unacceptable concepts, while
preserving their pairing with image embeddings, ensures both effectiveness and
utility. We evaluate Espresso on eleven concepts to show that it is effective
(~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93%
normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on
adversarial prompts for unacceptable concepts). Finally, we present theoretical
bounds for the certified robustness of Espresso against adversarial prompts,
and an empirical analysis. |
This paper introduces \method, a robust concept filtering technique for text-to-image (TTI) models that uses Contrastive Language-Image Pre-Training (CLIP) to identify and suppress the generation of unacceptable concepts in images. |
This paper addresses the crucial need for robust and utility-preserving concept removal techniques in TTI models. The presence of unacceptable concepts (e.g., copyrighted material, inappropriate content) in TTI outputs poses significant ethical and legal challenges. Existing methods either compromise utility for effectiveness or lack robustness against adversarial prompts. \method offers a novel approach that balances all three requirements, making it a valuable contribution to the field of safe and responsible TTI generation. |
The authors developed \method by modifying CLIP's classification objective to consider the cosine similarity of a generated image's embedding to both acceptable and unacceptable concept embeddings. This projection onto a lower-dimensional vector connecting the concepts enhances robustness. Further, they fine-tune \method to separate embeddings of acceptable and unacceptable concepts while preserving their pairing with image embeddings, ensuring effectiveness and utility. They evaluate \method's performance on eleven concepts, comparing it to six state-of-the-art fine-tuning concept removal techniques and one filtering technique. They also present theoretical bounds for certified robustness and empirical analysis. |
\method demonstrates effectiveness in suppressing unacceptable concepts, achieving a low CLIP accuracy on unacceptable prompts. It maintains high utility on acceptable prompts, showing comparable normalized CLIP scores to other techniques. Importantly, \method exhibits strong robustness against various adversarial attacks, including Typo+, PEZ+, CCE/CCE+, and RingBell+, outperforming existing techniques. The empirical evaluation of certified robustness further supports \method's resilience to adversarial noise in image embeddings. |
The paper acknowledges the limitations of the current certified robustness bound, which is loose compared to the distance between acceptable and unacceptable images. Future work involves tightening this bound and exploring adversarial training to further enhance robustness. Additionally, the paper suggests extending \method to handle multiple concept filtering simultaneously and optimizing it for filtering artistic styles, which currently poses a challenge due to the similarity of concept embeddings. |
diffusion_model, tti, clip, analysis, adversarial_attack, interpretability, robustness, concept_filtering, safety |
2403.17377 |
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance |
Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, Seungryong Kim |
Recent studies have demonstrated that diffusion models are capable of
generating high-quality samples, but their quality heavily depends on sampling
guidance techniques, such as classifier guidance (CG) and classifier-free
guidance (CFG). These techniques are often not applicable in unconditional
generation or in various downstream tasks such as image restoration. In this
paper, we propose a novel sampling guidance, called Perturbed-Attention
Guidance (PAG), which improves diffusion sample quality across both
unconditional and conditional settings, achieving this without requiring
additional training or the integration of external modules. PAG is designed to
progressively enhance the structure of samples throughout the denoising
process. It involves generating intermediate samples with degraded structure by
substituting selected self-attention maps in diffusion U-Net with an identity
matrix, by considering the self-attention mechanisms' ability to capture
structural information, and guiding the denoising process away from these
degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves
sample quality in conditional and even unconditional scenarios. Moreover, PAG
significantly improves the baseline performance in various downstream tasks
where existing guidances such as CG or CFG cannot be fully utilized, including
ControlNet with empty prompts and image restoration such as inpainting and
deblurring. |
This paper introduces Perturbed-Attention Guidance (PAG), a novel sampling guidance method for diffusion models that enhances sample quality by perturbing self-attention maps during the denoising process, eliminating the need for additional training or external modules. |
The paper addresses the limitations of existing guidance techniques like Classifier Guidance (CG) and Classifier-Free Guidance (CFG), which often lack applicability in unconditional generation or specific downstream tasks. PAG offers a more versatile approach, enhancing sample quality in both conditional and unconditional scenarios without requiring extra training or external components. |
PAG leverages the observation that self-attention maps in diffusion U-Nets capture structural information. The method perturbs these maps by replacing them with identity matrices, creating intermediate samples with degraded structures. This 'undesirable' path guides the denoising process towards generating samples with superior structural coherence and realism. |
PAG significantly improves sample quality in both ADM and Stable Diffusion, evident in enhanced FID and IS scores, particularly in unconditional generation where CFG is inapplicable. PAG also complements CFG, leading to further quality improvements when used in conjunction. The method's efficacy extends to downstream tasks like image restoration (PSLD) and spatially conditioned generation (ControlNet), demonstrating its versatility. |
The authors acknowledge limitations such as potential over-saturation at high guidance scales and the computational overhead of two forward passes per generation step. Future work could focus on mitigating these limitations by exploring techniques for efficient guidance computation and hyperparameter optimization. |
diffusion_model, guidance, self-attention, unconditional_generation, image_restoration, controlnet, sample_quality, pag |
2311.16973 |
DemoFusion: Democratising High-Resolution Image Generation With No $$$ |
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma |
High-resolution image generation with Generative Artificial Intelligence
(GenAI) has immense potential but, due to the enormous capital investment
required for training, it is increasingly centralised to a few large
corporations, and hidden behind paywalls. This paper aims to democratise
high-resolution GenAI by advancing the frontier of high-resolution generation
while remaining accessible to a broad audience. We demonstrate that existing
Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution
image generation. Our novel DemoFusion framework seamlessly extends open-source
GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated
Sampling mechanisms to achieve higher-resolution image generation. The
progressive nature of DemoFusion requires more passes, but the intermediate
results can serve as "previews", facilitating rapid prompt iteration. |
This paper introduces DemoFusion, a method for generating high-resolution images from pre-trained Latent Diffusion Models (LDMs) like SDXL without requiring additional training. |
This paper is important because it addresses the increasing centralization and paywalling of high-resolution image generation by enabling access to this technology using consumer-grade hardware and open-source models. |
The authors propose DemoFusion, which extends MultiDiffusion with three key mechanisms: Progressive Upscaling for iteratively enhancing image resolution, Skip Residual for maintaining global consistency, and Dilated Sampling for increasing global semantic coherence during image generation. |
DemoFusion generates high-resolution images with better quality and coherence compared to baselines like MultiDiffusion and SDXL+BSRGAN, as evidenced by qualitative and quantitative comparisons using metrics such as FID, IS, and CLIP score. |
Limitations include longer inference time due to progressive upscaling and dependence on the underlying LDM's performance. Future work could involve training LDMs specifically for DemoFusion or exploring more efficient inference strategies. |
diffusion_model, image_generation, high_resolution, sdxl, progressive_upscaling, skip_residual, dilated_sampling |
2311.18828 |
One-step Diffusion with Distribution Matching Distillation |
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, Taesung Park |
Diffusion models generate high-quality images but require dozens of forward
passes. We introduce Distribution Matching Distillation (DMD), a procedure to
transform a diffusion model into a one-step image generator with minimal impact
on image quality. We enforce the one-step image generator match the diffusion
model at distribution level, by minimizing an approximate KL divergence whose
gradient can be expressed as the difference between 2 score functions, one of
the target distribution and the other of the synthetic distribution being
produced by our one-step generator. The score functions are parameterized as
two diffusion models trained separately on each distribution. Combined with a
simple regression loss matching the large-scale structure of the multi-step
diffusion outputs, our method outperforms all published few-step diffusion
approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot
COCO-30k, comparable to Stable Diffusion but orders of magnitude faster.
Utilizing FP16 inference, our model generates images at 20 FPS on modern
hardware. |
This paper introduces Distribution Matching Distillation (DMD), a method for converting a diffusion model into a one-step image generator with minimal quality loss by minimizing the KL divergence between real and generated image distributions using a pair of diffusion models. |
This paper is important because it addresses the slow sampling speed of diffusion models, enabling near-real-time image generation with quality comparable to traditional multi-step methods. |
The authors train a one-step generator with a distribution matching loss, estimated from scores derived from two diffusion models, and a regression loss based on a pre-computed dataset of noise-image pairs from the original diffusion model. |
DMD outperforms existing diffusion distillation techniques, achieving FIDs of 2.62 on ImageNet 64x64 and 11.49 on zero-shot COCO-30k, comparable to Stable Diffusion but significantly faster (20 FPS). |
Limitations include a minor quality gap compared to multi-step diffusion and challenges in generating text and fine details. Future work involves distilling more advanced models and exploring variable guidance scales. |
diffusion_model, distillation, image_generation, text-to-image, one-step, kl_divergence, score_matching |
2405.05846 |
Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models |
Zhe Ma, Xuhong Zhang, Qingming Li, Tianyu Du, Wenzhi Chen, Zonghui Wang, Shouling Ji |
The past few years have witnessed substantial advancement in text-guided
image generation powered by diffusion models. However, it was shown that
text-to-image diffusion models are vulnerable to training image memorization,
raising concerns on copyright infringement and privacy invasion. In this work,
we perform practical analysis of memorization in text-to-image diffusion
models. Targeting a set of images to protect, we conduct quantitive analysis on
them without need to collect any prompts. Specifically, we first formally
define the memorization of image and identify three necessary conditions of
memorization, respectively similarity, existence and probability. We then
reveal the correlation between the model's prediction error and image
replication. Based on the correlation, we propose to utilize inversion
techniques to verify the safety of target images against memorization and
measure the extent to which they are memorized. Model developers can utilize
our analysis method to discover memorized images or reliably claim safety
against memorization. Extensive experiments on the Stable Diffusion, a popular
open-source text-to-image diffusion model, demonstrate the effectiveness of our
analysis method. |
This paper presents a practical method for analyzing memorization in text-to-image diffusion models, focusing on identifying and quantifying the extent to which specific images are memorized. |
This paper addresses the risk of copyright infringement and privacy violation posed by memorization in text-to-image diffusion models trained on massive datasets. It offers a practical tool for model developers to assess and mitigate these risks, contributing to responsible AI development. |
The authors define three conditions for memorization: similarity, existence, and probability. They propose using the model's prediction error as a measure of image replication (similarity). To find prompts that trigger memorization (existence), they develop a prompt inversion algorithm with regularization to ensure realistic token embeddings. Lastly, they measure the extent of memorization (probability) by comparing the prediction error distribution of the target image under the inverted prompt with that of a safe, unconditional diffusion model. |
The paper demonstrates that the model's prediction error effectively identifies image replication. The proposed prompt inversion method can successfully trigger memorization for a significant portion of known memorized images. Moreover, the analysis reveals that unconditional diffusion models are generally safe from memorization, validating their use as a baseline for measuring memorization in conditional models. |
The authors acknowledge two limitations. First, their hard prompt inversion algorithm, although outperforming existing methods, is not entirely foolproof, especially for images requiring multiple key tokens. Second, the analysis focuses on text-to-image models, with further research needed for other conditional diffusion models. Future work could focus on improving hard prompt inversion and expanding the analysis to different types of conditional diffusion models. |
diffusion_model, memorization, analysis, text-to-image, security, privacy, copyright, inversion |
2403.06634 |
Stealing Part of a Production Language Model |
Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, David Rolnick, Florian Tramèr |
We introduce the first model-stealing attack that extracts precise,
nontrivial information from black-box production language models like OpenAI's
ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding
projection layer (up to symmetries) of a transformer model, given typical API
access. For under \$20 USD, our attack extracts the entire projection matrix of
OpenAI's Ada and Babbage language models. We thereby confirm, for the first
time, that these black-box models have a hidden dimension of 1024 and 2048,
respectively. We also recover the exact hidden dimension size of the
gpt-3.5-turbo model, and estimate it would cost under \$2,000 in queries to
recover the entire projection matrix. We conclude with potential defenses and
mitigations, and discuss the implications of possible future work that could
extend our attack. |
This paper seems to be about a new security vulnerability of large language models (LLMs) where attackers can extract sensitive information like hidden dimensions and even potentially reconstruct the entire model architecture by analyzing just one layer's weights. |
This paper appears to be important because it exposes a critical security flaw in LLMs. Successfully demonstrating that even a single layer's weights can be used to compromise model security has significant implications for data privacy and model robustness. This knowledge is crucial for developing stronger defense mechanisms and understanding the broader security landscape of LLMs. |
While the exact methodology is not fully described in the provided text, it seems the authors plan to: 1) Formalize a threat model for these attacks. 2) Mathematically justify how extraction attacks are possible. 3) Describe various attack settings (sections 4.1-4.5). 4) Evaluate the attacks with white box attack results and test robustness against noise. 5) Analyze the impact of specific components like layer normalization. |
While definitive results aren't stated, the paper seems to suggest successful attacks are possible: - Extracting information from just one layer is possible and impactful. - Layer normalization might introduce additional vulnerabilities by adding another dimension for exploitation. - The attacks might be robust even against noise. |
The provided text outlines these limitations and future work: - **Limitations:** The effectiveness of defenses against this attack needs further investigation. The impact of obtaining only large singular values for the embedding matrix is unclear. - **Future Work:** Explore defenses based on noise injection and other mitigation strategies. Investigate the potential of utilizing the embedding layer for other types of attacks. Examine the possibility of bypassing output filters. |
llm, security, adversarial_attack, model_extraction, vulnerability, layer_normalization, defense, robustness |
2312.06655 |
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior |
Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, Yueqi Duan |
Recently, 3D content creation from text prompts has demonstrated remarkable
progress by utilizing 2D and 3D diffusion models. While 3D diffusion models
ensure great multi-view consistency, their ability to generate high-quality and
diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion
models find a distillation approach that achieves excellent generalization and
rich details without any 3D data. However, 2D lifting methods suffer from
inherent view-agnostic ambiguity thereby leading to serious multi-face Janus
issues, where text prompts fail to provide sufficient guidance to learn
coherent 3D results. Instead of retraining a costly viewpoint-aware model, we
study how to fully exploit easily accessible coarse 3D knowledge to enhance the
prompts and guide 2D lifting optimization for refinement. In this paper, we
propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity,
generalizability, and geometric consistency simultaneously. Specifically, we
design a pair of guiding strategies derived from the coarse 3D prior generated
by the 3D diffusion model: a structural guidance for geometric fidelity and a
semantic guidance for 3D coherence. Employing the two types of guidance, the 2D
diffusion model enriches the 3D content with diversified and high-quality
results. Extensive experiments show the superiority of our Sherpa3D over the
state-of-the-art text-to-3D methods in terms of quality and 3D consistency. |
This paper introduces Sherpa3D, a novel text-to-3D generation framework that leverages coarse 3D priors from 3D diffusion models to guide 2D diffusion models, achieving high-fidelity, diversified, and multi-view consistent 3D content. |
The paper addresses limitations in existing text-to-3D methods, which often struggle with either limited generalizability and quality (3D diffusion models) or multi-view inconsistency (2D lifting methods). Sherpa3D bridges this gap by combining the strengths of both approaches, offering a promising solution for efficient and high-quality 3D content creation. |
Sherpa3D employs a three-stage process: 1) It generates a coarse 3D prior using a 3D diffusion model. 2) It introduces structural and semantic guidance mechanisms derived from the 3D prior to guide the 2D lifting optimization. 3) It integrates the 3D guidance with a score distillation sampling (SDS) loss, using an annealing technique to balance the influence of 3D guidance and 2D refinement. This process enables Sherpa3D to produce detailed and consistent 3D objects from text prompts. |
Sherpa3D demonstrates superior performance over existing text-to-3D methods in both qualitative and quantitative evaluations. It generates high-fidelity 3D assets with compelling texture quality and multi-view consistency, outperforming baselines in terms of CLIP R-Precision and user-rated quality and consistency. The authors show that Sherpa3D is efficient, taking only 25 minutes to generate a 3D model from a text prompt. |
The authors acknowledge that the quality of Sherpa3D's output is inherently limited by the underlying 2D and 3D diffusion models used. Future work could explore leveraging larger, more advanced diffusion models (e.g., SDXL, DeepFloyd) to further enhance the generation quality. Additionally, the authors are interested in extending Sherpa3D's capabilities to more complex and creative tasks, such as text-to-4D generation. |
diffusion_model, 3d, text-to-3d, multi-view consistency, generative_model, score_distillation_sampling |
2308.09351 |
RLIPv2: Fast Scaling of Relational Language-Image Pre-training |
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao |
Relational Language-Image Pre-training (RLIP) aims to align vision
representations with relational texts, thereby advancing the capability of
relational reasoning in computer vision tasks. However, hindered by the slow
convergence of RLIPv1 architecture and the limited availability of existing
scene graph data, scaling RLIPv1 is challenging. In this paper, we propose
RLIPv2, a fast converging model that enables the scaling of relational
pre-training to large-scale pseudo-labelled scene graph data. To enable fast
scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism
that facilitates earlier and deeper gated cross-modal fusion with sparsified
language encoding layers. ALIF leads to comparable or better performance than
RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain
scene graph data at scale, we extend object detection datasets with free-form
relation labels by introducing a captioner (e.g., BLIP) and a designed Relation
Tagger. The Relation Tagger assigns BLIP-generated relation texts to region
pairs, thus enabling larger-scale relational pre-training. Through extensive
experiments conducted on Human-Object Interaction Detection and Scene Graph
Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under
fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2
achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with
just 1% data and yields 45.09mAP with 100% data. Code and models are publicly
available at https://github.com/JacobYuan7/RLIPv2. |
This paper introduces RLIPv2, an improved model for Relational Language-Image Pre-training (RLIP) that focuses on fast convergence and scalability by using large-scale pseudo-labelled scene graph data. |
This research is important because it addresses the limitations of RLIPv1, the previous iteration, which struggled with slow convergence and limited scene graph data. By enabling efficient scaling, RLIPv2 pushes the boundaries of relational reasoning in computer vision, achieving state-of-the-art results in tasks like HOI detection and scene graph generation. |
The authors introduce Asymmetric Language-Image Fusion (ALIF) for faster convergence, employing sparse language encoding and early fusion. To generate large-scale pseudo-labelled scene graph data, they combine object detection datasets with BLIP-generated captions and a Relation Tagger built on RLIPv2 itself. They conduct extensive experiments on datasets like HICO-DET, V-COCO, and Open Images v6, comparing various settings like zero-shot, few-shot, and fully-finetuned learning. |
RLIPv2 demonstrates superior performance across HOI detection and Scene Graph Generation benchmarks. Notably, it achieves state-of-the-art results on Open Images v6 for SGG and impressive zero-shot, few-shot, and fully-finetuned results on HICO-DET, demonstrating significant data efficiency and exceeding previous methods. For example, the largest RLIPv2 achieves 23.29mAP on HICO-DET without fine-tuning, 32.22mAP with 1% data, and 45.09mAP with full data. |
The authors acknowledge the reliance on external captioner quality as a limitation, where noisy captions can impact performance. Future work includes exploring advanced captioning techniques for higher-quality pseudo-labels and investigating methods to overcome challenges posed by complex scenes with multiple similar objects. |
diffusion_model, llm, analysis, 3d, motion, video, interpretability |
2405.04312 |
Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer |
Zhuoyi Yang, Heyang Jiang, Wenyi Hong, Jiayan Teng, Wendi Zheng, Yuxiao Dong, Ming Ding, Jie Tang |
Diffusion models have shown remarkable performance in image generation in
recent years. However, due to a quadratic increase in memory during generating
ultra-high-resolution images (e.g. 4096*4096), the resolution of generated
images is often limited to 1024*1024. In this work. we propose a unidirectional
block attention mechanism that can adaptively adjust the memory overhead during
the inference process and handle global dependencies. Building on this module,
we adopt the DiT structure for upsampling and develop an infinite
super-resolution model capable of upsampling images of various shapes and
resolutions. Comprehensive experiments show that our model achieves SOTA
performance in generating ultra-high-resolution images in both machine and
human evaluation. Compared to commonly used UNet structures, our model can save
more than 5x memory when generating 4096*4096 images. The project URL is
https://github.com/THUDM/Inf-DiT. |
This paper introduces Inf-DiT, a memory-efficient diffusion transformer model for upsampling images to ultra-high resolutions by leveraging a novel Unidirectional Block Attention (UniBA) mechanism to process images in smaller blocks, thereby significantly reducing memory requirements. |
This work addresses the critical limitation of existing diffusion models in generating ultra-high-resolution images due to quadratic memory scaling. Inf-DiT offers a solution by enabling the generation of images at resolutions exceeding 4096x4096 pixels, which was previously infeasible due to memory constraints, opening possibilities for various applications requiring high-fidelity visuals. |
The authors propose UniBA, which divides images into blocks and processes them sequentially in batches, minimizing the number of hidden states in memory at any given time. Inf-DiT incorporates this mechanism into a diffusion transformer architecture, utilizing techniques like global CLIP image embedding for semantic consistency and nearby LR cross-attention for local detail preservation. Trained on a dataset of high-resolution images and evaluated on benchmarks like HPDv2 and DIV2K, Inf-DiT demonstrates superior performance in image upsampling and super-resolution tasks. |
Inf-DiT achieves state-of-the-art performance on ultra-high resolution image generation (up to 4096x4096) as measured by FID and FIDcrop metrics, outperforming baselines like SDXL, MultiDiffusion, and DemoFusion. It also excels in classic super-resolution benchmarks on the DIV2K dataset, surpassing models like BSRGAN and StableSR. Human evaluations confirm Inf-DiT's superiority in detail authenticity, global coherence, and consistency with low-resolution inputs. Notably, it maintains a low memory footprint, approximately 5 times lower than SDXL when generating 4096x4096 images. |
The authors acknowledge limitations in iterative upsampling, where errors from earlier stages can propagate and be difficult to correct in later stages. Future work could explore techniques for error correction and improved handling of long-range dependencies during iterative upsampling. Additionally, investigating the application of UniBA to other diffusion-based tasks beyond image generation could be a promising direction. |
diffusion_model, transformer, super-resolution, image_generation, ultra-high-resolution, memory_efficient, uniba, inf-dit, clip |
2312.05491 |
Using Captum to Explain Generative Language Models |
Vivek Miglani, Aobo Yang, Aram H. Markosyan, Diego Garcia-Olano, Narine Kokhlikyan |
Captum is a comprehensive library for model explainability in PyTorch,
offering a range of methods from the interpretability literature to enhance
users' understanding of PyTorch models. In this paper, we introduce new
features in Captum that are specifically designed to analyze the behavior of
generative language models. We provide an overview of the available
functionalities and example applications of their potential for understanding
learned associations within generative language models. |
This paper introduces new features in Captum v0.7, a model explainability library for PyTorch, specifically designed to analyze the behavior of generative language models like GPT-3. |
This paper is important as it addresses the growing need for explainability in large language models (LLMs) by introducing new tools in Captum that enhance the understanding of these models, especially for critical applications. |
The authors introduce new functionalities in Captum, focusing on perturbation-based (Feature Ablation, LIME, Kernel SHAP, Shapley Value Sampling) and gradient-based (Saliency, Integrated Gradients) attribution methods. They provide APIs to define custom features, baselines, masking, and target selection for analyzing LLM behavior. |
The paper showcases the application of new Captum functionalities in understanding model associations, revealing potential biases by analyzing attribution scores for input features. Additionally, they demonstrate the evaluation of few-shot prompt effectiveness, highlighting an unexpected reduction in confidence for a sentiment prediction task. |
The authors acknowledge limitations in current attribution methods and highlight the need for automated feature and baseline selection. Future work involves incorporating other interpretability techniques, improving automation, and optimizing runtime performance for the open-source community. |
llm, analysis, interpretability, explainability, attribution, perturbation-based methods, gradient-based methods, open-source, captum |
2401.08573 |
Benchmarking the Robustness of Image Watermarks |
Bang An, Mucong Ding, Tahseen Rabbani, Aakriti Agrawal, Yuancheng Xu, Chenghao Deng, Sicheng Zhu, Abdirisak Mohamed, Yuxin Wen, Tom Goldstein, Furong Huang |
This paper investigates the weaknesses of image watermarking techniques. We
present WAVES (Watermark Analysis Via Enhanced Stress-testing), a novel
benchmark for assessing watermark robustness, overcoming the limitations of
current evaluation methods.WAVES integrates detection and identification tasks,
and establishes a standardized evaluation protocol comprised of a diverse range
of stress tests. The attacks in WAVES range from traditional image distortions
to advanced and novel variations of diffusive, and adversarial attacks. Our
evaluation examines two pivotal dimensions: the degree of image quality
degradation and the efficacy of watermark detection after attacks. We develop a
series of Performance vs. Quality 2D plots, varying over several prominent
image similarity metrics, which are then aggregated in a heuristically novel
manner to paint an overall picture of watermark robustness and attack potency.
Our comprehensive evaluation reveals previously undetected vulnerabilities of
several modern watermarking algorithms. We envision WAVES as a toolkit for the
future development of robust watermarking systems. The project is available at
https://wavesbench.github.io/ |
The paper introduces WAVES, a novel benchmark for evaluating the robustness of image watermarking techniques, specifically focusing on their resistance to various attacks that aim to remove or obscure watermarks. |
This paper is important because it addresses the lack of standardized evaluation methods for image watermarking techniques, especially in the context of emerging threats like diffusion purification and adversarial attacks. It proposes a comprehensive benchmark with diverse attacks, standardized metrics, and a focus on real-world scenarios, contributing to the development of more robust watermarking systems. |
The authors conduct their research by developing a standardized evaluation protocol called WAVES. WAVES evaluates watermarking algorithms on three datasets (DiffusionDB, MS-COCO, and DALL·E3) using a wide range of 26 attacks categorized into distortions, regenerations, and adversarial attacks. It measures watermark detection performance using TPR@0.1%FPR and assesses image quality degradation using a normalized and aggregated metric combining 8 individual image quality metrics. |
The evaluation reveals varying vulnerabilities among watermarking methods. Tree-Ring is particularly vulnerable to adversarial attacks, especially grey-box embedding attacks and surrogate detector attacks, which can significantly reduce detection performance while preserving image quality. Stable Signature is susceptible to various regeneration attacks, while StegaStamp demonstrates greater robustness overall. The paper also highlights the risk of using publicly available VAEs in watermarking systems, making them susceptible to attacks. |
The authors acknowledge limitations in testing only three watermarking algorithms, albeit carefully chosen representatives. They also point out that the attack ranking methodology depends on selected performance thresholds and image quality metrics, suggesting further exploration with alternative metrics and thresholds as future work. Additionally, the paper encourages the development of watermark-specific defensive strategies and highlights the need for in-processing watermarks to adopt augmentation or adversarial training for enhanced robustness. |
diffusion_model, watermark, analysis, adversarial_attack, benchmark, image_quality, robustness |
2404.02145 |
Iterated Learning Improves Compositionality in Large Vision-Language Models |
Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna |
A fundamental characteristic common to both human vision and natural language
is their compositional nature. Yet, despite the performance gains contributed
by large vision and language pretraining, recent investigations find that
most-if not all-our state-of-the-art vision-language models struggle at
compositionality. They are unable to distinguish between images of " a girl in
white facing a man in black" and "a girl in black facing a man in white".
Moreover, prior work suggests that compositionality doesn't arise with scale:
larger model sizes or training data don't help. This paper develops a new
iterated training algorithm that incentivizes compositionality. We draw on
decades of cognitive science research that identifies cultural transmission-the
need to teach a new generation-as a necessary inductive prior that incentivizes
humans to develop compositional languages. Specifically, we reframe
vision-language contrastive learning as the Lewis Signaling Game between a
vision agent and a language agent, and operationalize cultural transmission by
iteratively resetting one of the agent's weights during training. After every
iteration, this training paradigm induces representations that become "easier
to learn", a property of compositional languages: e.g. our model trained on
CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the
SugarCrepe benchmark. |
This paper proposes a novel iterated learning algorithm for vision-language models to improve their compositionality by drawing inspiration from the cultural transmission theory in cognitive science, where languages evolve to be more compositional over generations. |
Despite the advancement in large vision-language models, existing models struggle with compositional understanding, limiting their ability to generalize and reason about novel situations. This paper addresses this issue with a novel training paradigm inspired by human language development, potentially paving the way for more robust and interpretable vision-language models. |
The authors reframe vision-language contrastive learning as a Lewis Signaling Game between a vision agent and a language agent. They introduce a shared codebook as the basis for the representation of both agents, and periodically reset the language agent's weights, mimicking cultural transmission across generations. This forces the vision agent to learn representations that are easier to learn by new language agents, thus improving compositionality. |
The proposed iterated learning algorithm demonstrably improves compositionality on several benchmarks, including SugarCrepe and CREPE, outperforming baseline models like standard CLIP and NegCLIP. Importantly, this improvement doesn't come at the cost of recognition capability, as shown by comparable performance on zero-shot image classification tasks. Further analysis suggests that iterated learning leads to smoother, easier-to-learn visual representations and a more interpretable codebook. |
The paper acknowledges the potential instability during training due to the randomness introduced by resetting agent weights. Future work could focus on stabilizing the learning process and exploring extensions to other domains beyond vision and language. |
diffusion_model, llm, analysis, interpretability |
2312.02139 |
DiffiT: Diffusion Vision Transformers for Image Generation |
Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat |
Diffusion models with their powerful expressivity and high sample quality
have achieved State-Of-The-Art (SOTA) performance in the generative domain. The
pioneering Vision Transformer (ViT) has also demonstrated strong modeling
capabilities and scalability, especially for recognition tasks. In this paper,
we study the effectiveness of ViTs in diffusion-based generative learning and
propose a new model denoted as Diffusion Vision Transformers (DiffiT).
Specifically, we propose a methodology for finegrained control of the denoising
process and introduce the Time-dependant Multihead Self Attention (TMSA)
mechanism. DiffiT is surprisingly effective in generating high-fidelity images
with significantly better parameter efficiency. We also propose latent and
image space DiffiT models and show SOTA performance on a variety of
class-conditional and unconditional synthesis tasks at different resolutions.
The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet-256
dataset while having 19.85%, 16.88% less parameters than other
Transformer-based diffusion models such as MDT and DiT, respectively. Code:
https://github.com/NVlabs/DiffiT |
This paper introduces DiffiT, a novel Vision Transformer (ViT)-based diffusion model designed for efficient and high-quality image generation in both latent and image spaces. |
The paper addresses limitations in existing CNN-based and ViT-based diffusion models by introducing Time-dependant Multihead Self Attention (TMSA), which significantly enhances parameter efficiency and enables fine-grained control over the denoising process for improved image fidelity and diversity. |
The authors propose a novel TMSA mechanism integrated into a U-shaped encoder-decoder architecture for image space generation and a purely ViT-based architecture for latent space generation. They train and evaluate DiffiT on diverse datasets, including ImageNet, FFHQ, and CIFAR10, and conduct thorough ablation studies to validate the effectiveness of TMSA and other architectural choices. |
DiffiT achieves state-of-the-art FID scores on ImageNet-256 with significantly fewer parameters compared to previous SOTA models like MDT and DiT. It also achieves competitive results on FFHQ-64 and CIFAR10, showcasing its ability to generate high-fidelity, diverse images across different datasets and resolutions. |
The paper acknowledges potential limitations in extending DiffiT to higher resolutions and exploring more complex image generation tasks. Future work could focus on optimizing the model for memory efficiency, leveraging larger datasets for training, and exploring applications in image editing, restoration, and text-to-image generation. |
diffusion_model, vit, image_generation, tmsa, self-attention, latent_space, image_space, fid, parameter_efficiency |
2404.08636 |
Probing the 3D Awareness of Visual Foundation Models |
Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani |
Recent advances in large-scale pretraining have yielded visual foundation
models with strong capabilities. Not only can recent models generalize to
arbitrary images for their training task, their intermediate representations
are useful for other visual tasks such as detection and segmentation. Given
that such models can classify, delineate, and localize objects in 2D, we ask
whether they also represent their 3D structure? In this work, we analyze the 3D
awareness of visual foundation models. We posit that 3D awareness implies that
representations (1) encode the 3D structure of the scene and (2) consistently
represent the surface across views. We conduct a series of experiments using
task-specific probes and zero-shot inference procedures on frozen features. Our
experiments reveal several limitations of the current models. Our code and
analysis can be found at https://github.com/mbanani/probe3d. |
This paper investigates the 3D awareness of visual foundation models, examining how well these models represent the 3D structure of scenes and objects from single and multiple views. |
This paper is important because it addresses the lack of understanding regarding how well visual foundation models, despite being trained on 2D data, represent the 3D world. This understanding is crucial as these models are increasingly used as backbones for 3D vision tasks. |
The authors evaluate a range of visual foundation models, including those trained with classification, language supervision, self-supervision, and dense supervision, on their ability to estimate depth, surface normals, and 3D correspondence. They probe the frozen representations of these models using task-specific probes and zero-shot inference methods to assess the inherent 3D awareness of the learned features. |
The analysis revealed that self-supervised models perform best in capturing surface properties like depth and normals, followed by text-conditioned generative models. However, all models struggled with multiview consistency, particularly at large viewpoint changes, indicating they might be learning view-dependent rather than truly 3D-consistent representations. Semantic correspondence performance was found to be more correlated with single-view tasks than multiview tasks, suggesting it might not be a reliable measure of 3D consistency. |
The paper acknowledges limitations including the use of publicly available checkpoints trained on different datasets and with varying compute resources, potentially confounding the results. They suggest future work should focus on more controlled experiments to isolate the impact of training signals and explore a broader range of 3D understanding aspects beyond surface reconstruction and multiview consistency. |
3d, analysis, depth_estimation, surface_normal, correspondence, vision_transformer, diffusion_model, self_supervised_learning, vision_language_model |
2405.05806 |
MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation |
Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Lei Zhang, Wangmeng Zuo |
Text-to-image (T2I) diffusion models have shown significant success in
personalized text-to-image generation, which aims to generate novel images with
human identities indicated by the reference images. Despite promising identity
fidelity has been achieved by several tuning-free methods, they usually suffer
from overfitting issues. The learned identity tends to entangle with irrelevant
information, resulting in unsatisfied text controllability, especially on
faces. In this work, we present MasterWeaver, a test-time tuning-free method
designed to generate personalized images with both faithful identity fidelity
and flexible editability. Specifically, MasterWeaver adopts an encoder to
extract identity features and steers the image generation through additional
introduced cross attention. To improve editability while maintaining identity
fidelity, we propose an editing direction loss for training, which aligns the
editing directions of our MasterWeaver with those of the original T2I model.
Additionally, a face-augmented dataset is constructed to facilitate
disentangled identity learning, and further improve the editability. Extensive
experiments demonstrate that our MasterWeaver can not only generate
personalized images with faithful identity, but also exhibit superiority in
text controllability. Our code will be publicly available at
https://github.com/csyxwei/MasterWeaver. |
This paper introduces MasterWeaver, a novel method for personalized text-to-image generation that prioritizes both accurate identity representation and flexible image editing capabilities from a single reference image. |
This paper addresses the limitations of existing personalized text-to-image generation models, which often struggle to balance accurate identity preservation with flexible editing. MasterWeaver's ability to achieve both makes it a valuable tool for various applications, including personalized content creation. |
MasterWeaver leverages a pre-trained Stable Diffusion model and incorporates an identity mapping network to inject identity features into the image generation process. It introduces an editing direction loss to improve text controllability and utilizes a face-augmented dataset to disentangle identity features from attributes, enhancing editability. |
Experimental results demonstrate that MasterWeaver outperforms state-of-the-art methods in terms of identity fidelity, text alignment, and image quality. It produces high-quality personalized images with diverse attributes, clothing, backgrounds, and styles, even from a single reference image. |
The authors acknowledge limitations in generating images with multiple personalized identities and achieving precise control over fine-grained attributes. Future work will address these limitations and explore ethical considerations related to potential deepfake generation. |
diffusion_model, personalized_text-to_image_generation, identity_preservation, editability, face_editing, cross_attention |
2402.10208 |
Recovering the Pre-Fine-Tuning Weights of Generative Models |
Eliahu Horwitz, Jonathan Kahana, Yedid Hoshen |
The dominant paradigm in generative modeling consists of two steps: i)
pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained
model with human values via fine-tuning. This practice is considered safe, as
no current method can recover the unsafe, pre-fine-tuning model weights. In
this paper, we demonstrate that this assumption is often false. Concretely, we
present Spectral DeTuning, a method that can recover the weights of the
pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In
contrast to previous attacks that attempt to recover pre-fine-tuning
capabilities, our method aims to recover the exact pre-fine-tuning weights. Our
approach exploits this new vulnerability against large-scale models such as a
personalized Stable Diffusion and an aligned Mistral. |
This paper introduces the task of "Pre-Fine-Tuning Weight Recovery", a novel attack vector targeting fine-tuned models. It presents "Spectral DeTuning", an effective method for recovering the original weights of a pre-trained model using multiple LoRA fine-tuned versions. |
This paper highlights a critical vulnerability in the current paradigm of model fine-tuning, particularly relevant due to the increasing popularity of LoRA and multi-flavored foundational models. It demonstrates that widely used models like Mistral and Stable Diffusion are susceptible to this attack, potentially compromising safety and alignment efforts. |
The authors propose "Spectral DeTuning", an iterative, gradient-free algorithm leveraging low-rank matrix factorization to recover the pre-fine-tuning weights. They introduce a rank scheduler for enhanced optimization stability and faster convergence. They evaluate their method on a newly introduced benchmark "LoWRA Bench", comprising diverse models like ViT, Stable Diffusion, and Mistral, fine-tuned for various tasks. |
Spectral DeTuning successfully recovers pre-fine-tuning weights with high precision across different models and tasks. It outperforms baseline methods, achieving near-perfect semantic convergence for ViT and effectively reversing personalization in Stable Diffusion and alignment in Mistral, as demonstrated by semantic evaluation metrics. The rank scheduler significantly improves convergence speed and accuracy. |
The authors acknowledge limitations like the requirement of multiple LoRA models with a known, constant rank and the assumption of their public availability. Future work includes exploring attacks on models with varying LoRA ranks, extending the attack to other fine-tuning methods, and, most importantly, developing defenses against pre-fine-tuning weight recovery attacks. |
diffusion_model, llm, analysis, adversarial_attack, interpretability, lora, fine-tuning, model_security, weight_recovery |
2403.14599 |
MyVLM: Personalizing VLMs for User-Specific Queries |
Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or |
Recent large-scale vision-language models (VLMs) have demonstrated remarkable
capabilities in understanding and generating textual descriptions for visual
content. However, these models lack an understanding of user-specific concepts.
In this work, we take a first step toward the personalization of VLMs, enabling
them to learn and reason over user-provided concepts. For example, we explore
whether these models can learn to recognize you in an image and communicate
what you are doing, tailoring the model to reflect your personal experiences
and relationships. To effectively recognize a variety of user-specific
concepts, we augment the VLM with external concept heads that function as
toggles for the model, enabling the VLM to identify the presence of specific
target concepts in a given image. Having recognized the concept, we learn a new
concept embedding in the intermediate feature space of the VLM. This embedding
is tasked with guiding the language model to naturally integrate the target
concept in its generated response. We apply our technique to BLIP-2 and LLaVA
for personalized image captioning and further show its applicability for
personalized visual question-answering. Our experiments demonstrate our ability
to generalize to unseen images of learned concepts while preserving the model
behavior on unrelated inputs. |
This paper introduces the concept of personalizing vision-language models (VLMs), enabling them to understand and reason about user-specific concepts, such as unique objects and individuals, with a focus on personalized image captioning and visual question answering. |
This paper is important because it addresses the limitation of current VLMs in understanding user-specific concepts and proposes a method for personalization, opening up new opportunities for more meaningful and personalized human-computer interaction. |
The authors propose MyVLM, a method that augments frozen VLMs (BLIP-2 and LLaVA) with concept heads to recognize user-specific concepts in images. It then learns a concept embedding in the VLM's feature space to guide the language model in incorporating the concept into generated responses, requiring only a few training images. |
MyVLM successfully generates personalized captions and answers questions about user-specific objects and individuals in new images, generalizing to unseen contexts. It outperforms several handcrafted baselines, showing improved recall and text similarity, even with few training samples. |
Limitations include biases inherent in VLMs, reliance on concept head quality, and potential context leakage during training. Future work includes mitigating these limitations, exploring additional regularization and augmentation techniques, and expanding to new personalized applications. |
diffusion_model, llm, analysis, personalization, image_captioning, visual_question_answering, referring_expression_comprehension |
2405.07987 |
The Platonic Representation Hypothesis |
Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola |
We argue that representations in AI models, particularly deep networks, are
converging. First, we survey many examples of convergence in the literature:
over time and across multiple domains, the ways by which different neural
networks represent data are becoming more aligned. Next, we demonstrate
convergence across data modalities: as vision models and language models get
larger, they measure distance between datapoints in a more and more alike way.
We hypothesize that this convergence is driving toward a shared statistical
model of reality, akin to Plato's concept of an ideal reality. We term such a
representation the platonic representation and discuss several possible
selective pressures toward it. Finally, we discuss the implications of these
trends, their limitations, and counterexamples to our analysis. |
This paper proposes the Platonic Representation Hypothesis, which posits that neural networks, trained across various architectures and modalities, are converging towards a shared statistical representation of reality. |
This hypothesis is significant because it suggests that scaling data and model size could be sufficient for achieving highly generalizable AI systems capable of performing well across a wide range of tasks. It also offers insights into the potential for cross-modal synergy and a deeper understanding of how AI models represent the world. |
The authors provide evidence for their hypothesis by analyzing existing literature on representational similarity and conducting experiments measuring the alignment of vision and language models. They use techniques like model stitching, nearest-neighbor analysis, and compare representations across models trained on different datasets and with different objectives. |
Key findings include: (1) Models with higher performance on a variety of tasks exhibit greater representational alignment, suggesting convergence towards a common solution as competence increases. (2) Alignment is observed even across modalities, with larger language models exhibiting greater alignment with vision models. (3) Alignment with vision representations is correlated with better performance on language-based reasoning tasks, indicating the practical benefits of such convergence. |
The authors acknowledge limitations such as difficulty in measuring alignment and the possibility of modality-specific information hindering complete convergence. They suggest further research is needed to understand the precise representation being converged to, the role of non-bijective modalities, and the implications for special-purpose AI. |
representation, convergence, multimodality, vision, language, scaling, analysis, platonic_representation |
2404.13040 |
Analysis of Classifier-Free Guidance Weight Schedulers |
Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernandez Abrevaya, David Picard, Vicky Kalogeiton |
Classifier-Free Guidance (CFG) enhances the quality and condition adherence
of text-to-image diffusion models. It operates by combining the conditional and
unconditional predictions using a fixed weight. However, recent works vary the
weights throughout the diffusion process, reporting superior results but
without providing any rationale or analysis. By conducting comprehensive
experiments, this paper provides insights into CFG weight schedulers. Our
findings suggest that simple, monotonically increasing weight schedulers
consistently lead to improved performances, requiring merely a single line of
code. In addition, more complex parametrized schedulers can be optimized for
further improvement, but do not generalize across different models and tasks. |
This paper investigates the use of dynamic weight schedulers in Classifier-Free Guidance (CFG) for diffusion models, showing that these schedulers can improve image fidelity, diversity, and textual adherence compared to static CFG. |
This paper is important because it provides a comprehensive analysis of dynamic guidance weight schedulers in CFG, which is a widely used technique for conditional diffusion models. The findings provide practical guidance for practitioners to improve the performance of their diffusion models with simple modifications. |
The authors conducted experiments on various tasks, including class-conditioned image generation and text-to-image generation, using datasets like CIFAR-10, ImageNet, and LAION. They evaluated different heuristic and parameterized dynamic schedulers, comparing their performance against static CFG using metrics like FID, Inception Score, CLIP-Score, and diversity measures. They also performed a user study to assess the perceptual quality of generated images. |
Key findings include: (1) monotonically increasing weight schedulers (e.g., linear and cosine) consistently improve performance over static CFG; (2) a simple linear scheduler significantly enhances results without additional computational cost or parameter tuning; (3) parameterized schedulers can further improve performance but require tuning for each model and task. |
The authors acknowledge that the optimal parameters for parameterized schedulers do not generalize across different models and tasks. Future work could focus on developing more adaptable and robust parameterized schedulers. Another direction is to investigate the theoretical underpinnings of why dynamic schedulers work better than static CFG, leading to more principled design of these schedulers. |
diffusion_model, cfg, analysis, image_generation, text-to-image, fid, inception_score, clip-score, diversity |
2405.05538 |
A Survey on Personalized Content Synthesis with Diffusion Models |
Xulu Zhang, Xiao-Yong Wei, Wengyu Zhang, Jinlin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li |
Recent advancements in generative models have significantly impacted content
creation, leading to the emergence of Personalized Content Synthesis (PCS).
With a small set of user-provided examples, PCS aims to customize the subject
of interest to specific user-defined prompts. Over the past two years, more
than 150 methods have been proposed. However, existing surveys mainly focus on
text-to-image generation, with few providing up-to-date summaries on PCS. This
paper offers a comprehensive survey of PCS, with a particular focus on the
diffusion models. Specifically, we introduce the generic frameworks of PCS
research, which can be broadly classified into optimization-based and
learning-based approaches. We further categorize and analyze these
methodologies, discussing their strengths, limitations, and key techniques.
Additionally, we delve into specialized tasks within the field, such as
personalized object generation, face synthesis, and style personalization,
highlighting their unique challenges and innovations. Despite encouraging
progress, we also present an analysis of the challenges such as overfitting and
the trade-off between subject fidelity and text alignment. Through this
detailed overview and analysis, we propose future directions to advance the
development of PCS. |
This paper presents a comprehensive survey of Personalized Content Synthesis (PCS) with diffusion models, focusing on techniques that enable the generation of customized images based on user-provided references and text prompts. |
This survey is important due to the rapid growth and significance of PCS in various applications, including content creation, digital marketing, and virtual reality. It provides a timely and comprehensive overview of this evolving field, analyzing different frameworks, specialized tasks, and future challenges. |
The paper categorizes PCS approaches into optimization-based and learning-based methods, analyzing their strengths and limitations. It reviews specialized tasks like object, style, and face personalization, highlighting key techniques like attention manipulation and mask-guided generation. |
The survey reveals significant progress in PCS, with methods achieving impressive results in generating personalized content. It identifies key techniques like attention-based operations, mask-guided generation, data augmentation, and regularization as crucial for improving PCS performance. The paper also provides a comparative analysis of different PCS methods and their performance on benchmark datasets. |
The paper identifies key challenges in PCS, including overfitting to limited references, balancing subject fidelity with text alignment, and the lack of standardized evaluation metrics and datasets. It suggests future research directions, such as exploring new architectures, training methodologies, and robust evaluation techniques to address these limitations. |
diffusion_model, personalized_content_synthesis, image_generation, optimization, learning_based, attention_mechanism, mask-guided, data_augmentation, regularization, object_generation, face_synthesis, style_personalization, video, 3d |
2308.08428 |
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption |
Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu |
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with
image-text pairs collected from the web. However, the presence of intrinsic
noise and unmatched image-text pairs in web data can potentially affect the
performance of representation learning. To address this issue, we first utilize
the OFA model to generate synthetic captions that focus on the image content.
The generated captions contain complementary information that is beneficial for
pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP),
a bi-path model that integrates supervision from both raw text and synthetic
caption. As the core components of ALIP, the Language Consistency Gate (LCG)
and Description Consistency Gate (DCG) dynamically adjust the weights of
samples and image-text/caption pairs during the training process. Meanwhile,
the adaptive contrastive loss can effectively reduce the impact of noise data
and enhances the efficiency of pre-training data. We validate ALIP with
experiments on different scales of models and pre-training datasets.
Experiments results show that ALIP achieves state-of-the-art performance on
multiple downstream tasks including zero-shot image-text retrieval and linear
probe. To facilitate future research, the code and pre-trained models are
released at https://github.com/deepglint/ALIP. |
This paper presents ALIP (Adaptive Language-Image Pre-training), a novel method for improving vision-language pre-training by addressing the issue of noisy and mismatched image-text pairs in large web-crawled datasets. |
This paper is important because it tackles the critical challenge of data noise in large-scale vision-language pre-training, which can negatively impact model performance. ALIP offers a computationally efficient alternative to existing filtering or momentum-based methods by leveraging synthetic captions and a novel adaptive contrastive loss. |
The authors propose a bi-path model that leverages both raw text and synthetic captions generated by the OFA model. They introduce two key components: the Language Consistency Gate (LCG), which weighs samples based on the consistency between raw and synthetic captions, and the Description Consistency Gate (DCG), which weighs image-text pairs based on their alignment. These weights are then integrated into an adaptive contrastive loss function to guide training. |
ALIP achieves state-of-the-art performance on zero-shot image-text retrieval tasks, demonstrating significant improvements over previous methods. It also shows competitive results on linear probe evaluations, indicating its strong representation learning capabilities. However, it lags behind state-of-the-art in zero-shot classification tasks, suggesting that the coarse granularity of the synthetic captions might limit performance in fine-grained tasks. |
The authors acknowledge limitations in the granularity of synthetic captions, which might hinder performance on tasks requiring fine-grained understanding. Future work includes exploring higher-quality caption generation models and investigating techniques to incorporate hierarchical semantic information into ALIP. |
diffusion_model, llm, analysis, image-text retrieval, contrastive_learning, pre-training, noise_alleviation |
2311.05020 |
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models |
Naomi Saphra, Eve Fleisig, Kyunghyun Cho, Adam Lopez |
Many NLP researchers are experiencing an existential crisis triggered by the
astonishing success of ChatGPT and other systems based on large language models
(LLMs). After such a disruptive change to our understanding of the field, what
is left to do? Taking a historical lens, we look for guidance from the first
era of LLMs, which began in 2005 with large $n$-gram models for machine
translation (MT). We identify durable lessons from the first era, and more
importantly, we identify evergreen problems where NLP researchers can continue
to make meaningful contributions in areas where LLMs are ascendant. We argue
that disparities in scale are transient and researchers can work to reduce
them; that data, rather than hardware, is still a bottleneck for many
applications; that meaningful realistic evaluation is still an open problem;
and that there is still room for speculative approaches. |
This paper examines the "scale crisis" in NLP research, where the dominance of large language models (LLMs) trained on massive datasets challenges the relevance of research from smaller groups. By reflecting on the history of statistical machine translation (SMT) and its own era of LLMs, the authors argue that the current crisis is transient and propose research directions for meaningful contributions even in the age of massive models. |
The paper addresses the widespread anxiety among NLP researchers about the impact of LLMs on the field. It provides historical context and practical guidance for navigating the challenges and opportunities presented by the current research landscape. |
The authors analyze the trajectory of SMT, particularly the rise and fall of large n-gram models, drawing parallels to the current era of LLMs. They use this historical analysis to identify durable lessons and evergreen research problems relevant to the present situation. |
The paper highlights that scale disparities are often temporary, as demonstrated by the eventual accessibility of large-scale SMT systems in the past. It argues that data remains a significant bottleneck, especially for low-resource languages, and emphasizes the crucial need for improved evaluation metrics that accurately capture model performance beyond simple benchmarks. |
The paper acknowledges its limitations in predicting the future of NLP research. It suggests future work should focus on improving evaluation metrics, developing algorithms for future hardware, exploring new paradigms that might supersede current LLMs, and addressing ethical considerations related to data bias and human evaluation. |
llm, analysis, literature_review, machine_translation, evaluation, data_scarcity, hardware, future_work |
2312.02142 |
Object Recognition as Next Token Prediction |
Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, Ser-Nam Lim |
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the
text tokens from image embeddings to form labels. To ground this prediction
process in auto-regression, we customize a non-causal attention mask for the
decoder, incorporating two key features: modeling tokens from different labels
to be independent, and treating image tokens as a prefix. This masking
mechanism inspires an efficient method - one-shot sampling - to simultaneously
sample tokens of multiple labels in parallel and rank generated labels by their
probabilities during inference. To further enhance the efficiency, we propose a
simple strategy to construct a compact decoder by simply discarding the
intermediate blocks of a pretrained language model. This approach yields a
decoder that matches the full model's performance while being notably more
efficient. The code is available at https://github.com/kaiyuyue/nxtp |
This paper presents a novel approach to object recognition by framing it as a next token prediction problem, utilizing a language decoder to auto-regressively predict object labels from image embeddings. |
The paper is significant because it offers an open-vocabulary object recognition method that eliminates the need for predefined object labels or descriptions, unlike traditional linear classifiers and contrastive frameworks. It proposes an efficient and innovative one-shot sampling method for parallel label generation and introduces a compact decoder for enhanced efficiency. |
The authors employ a pretrained CLIP image encoder to generate image embeddings and a truncated language decoder (derived from LLaMA) to predict labels auto-regressively. They introduce a non-causal attention mask to decouple tokens from different labels and treat image tokens as a prefix. The one-shot sampling method enables parallel label token generation, while the compact decoder enhances efficiency. The method is trained on large-scale image-caption pairs and evaluated using a semantic similarity-based metric. |
Key findings include the effectiveness of one-shot sampling for generating diverse labels in parallel, outperforming traditional greedy and beam search methods. The truncated language decoder achieves comparable performance to the full model while being significantly faster. The method surpasses existing open-vocabulary recognition approaches in recall and achieves competitive performance in precision, demonstrating its ability to generate highly relevant labels. |
The authors acknowledge limitations in training data quality and evaluation metrics. They suggest future work exploring methods to train models with fewer labels, refining the label definition, developing better evaluation metrics, and adapting the approach for fine-grained recognition tasks. |
object_recognition, next_token_prediction, language_decoder, auto-regressive, open_vocabulary, one-shot_sampling, truncated_language_model, llama, clip, semantic_similarity, efficiency |
2404.19752 |
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation |
Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui |
Existing automatic captioning methods for visual content face challenges such
as lack of detail, content hallucination, and poor instruction following. In
this work, we propose VisualFactChecker (VFC), a flexible training-free
pipeline that generates high-fidelity and detailed captions for both 2D images
and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text
captioning models propose multiple initial captions; 2) verification, where a
large language model (LLM) utilizes tools such as object detection and VQA
models to fact-check proposed captions; 3) captioning, where an LLM generates
the final caption by summarizing caption proposals and the fact check
verification results. In this step, VFC can flexibly generate captions in
various styles following complex instructions. We conduct comprehensive
captioning evaluations using four metrics: 1) CLIP-Score for image-text
similarity; 2) CLIP-Image-Score for measuring the image-image similarity
between the original and the reconstructed image generated by a text-to-image
model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V
for fine-grained evaluation. Evaluation results show that VFC outperforms
state-of-the-art open-sourced captioning methods for 2D images on the COCO
dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by
combining open-source models into a pipeline, we can attain captioning
capability comparable to proprietary models such as GPT-4V, despite being over
10x smaller in model size. |
This paper introduces VisualFactChecker (VFC), a training-free pipeline that leverages large language models (LLMs) and existing computer vision models to generate detailed and factually grounded captions for both 2D images and 3D objects. |
This work addresses limitations in current open-source image captioning models, which often produce captions that are too concise or contain hallucinations (i.e., descriptions of elements not present in the image). VFC utilizes a unique approach of fact-checking generated captions using object detection and visual question answering (VQA), resulting in higher fidelity and accuracy compared to existing open-sourced models. |
VFC operates through a three-step process: 1) **Proposal:** Multiple image captioning models generate initial captions. 2) **Verification:** An LLM uses object detection or VQA models to verify elements described in the captions. 3) **Captioning:** The LLM summarizes the initial captions and verification results to produce a final, factually grounded caption. The authors evaluate VFC on COCO (2D images) and Objaverse (3D objects) datasets using CLIP-Score, a novel CLIP-Image-Score, human evaluation via AMT, and GPT-4V for fine-grained analysis. |
VFC outperforms state-of-the-art open-source captioning methods in both 2D and 3D captioning tasks. Notably, it achieves performance comparable to proprietary models like GPT-4V despite being significantly smaller. The novel CLIP-Image-Score, introduced in this work, demonstrates effectiveness in detecting hallucinations by comparing original images with those reconstructed from generated captions. |
The authors acknowledge that the current implementation of VFC could be more automated in deciding which components to utilize for specific scenarios. Future work aims to address this limitation and explore the inclusion of additional components for fact-checking to further improve caption accuracy and detail. |
diffusion_model, llm, captioning, 2d, 3d, hallucination, vqa, object_detection, analysis |
2308.16512 |
MVDream: Multi-view Diffusion for 3D Generation |
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang |
We introduce MVDream, a diffusion model that is able to generate consistent
multi-view images from a given text prompt. Learning from both 2D and 3D data,
a multi-view diffusion model can achieve the generalizability of 2D diffusion
models and the consistency of 3D renderings. We demonstrate that such a
multi-view diffusion model is implicitly a generalizable 3D prior agnostic to
3D representations. It can be applied to 3D generation via Score Distillation
Sampling, significantly enhancing the consistency and stability of existing
2D-lifting methods. It can also learn new concepts from a few 2D examples, akin
to DreamBooth, but for 3D generation. |
This paper introduces MVDream, a multi-view diffusion model that addresses the multi-view consistency issues in text-to-3D generation by leveraging large-scale 2D image datasets and 3D data for improved generalizability and consistency in generated 3D models. |
This work is important because it presents a novel approach to address the long-standing challenge of multi-view consistency in text-to-3D generation, which is crucial for creating high-quality and realistic 3D content. MVDream's ability to leverage pre-trained 2D diffusion models and adapt them for multi-view consistency opens new avenues for efficient and robust 3D content creation. |
The authors propose a multi-view diffusion model that incorporates 3D self-attention and camera embeddings into a pre-trained 2D diffusion model. They train this model on a combination of 3D rendered data and a large-scale text-to-image dataset. For 3D generation, they employ score distillation sampling (SDS), utilizing their multi-view diffusion model as a prior. They further introduce a multi-view DreamBooth technique for personalized 3D generation. |
MVDream demonstrates superior multi-view consistency and overall quality in generated 3D models compared to existing state-of-the-art methods. Notably, it mitigates the Janus problem (multi-face issue) commonly observed in other approaches. User studies confirm the improved robustness and quality of MVDream's generated 3D assets. Furthermore, the model exhibits good generalization ability, effectively generating 3D content from unseen prompts and in diverse styles. |
The authors acknowledge limitations such as the current model's lower resolution compared to some existing models and the potential for bias inherited from the base Stable Diffusion model. They suggest addressing these limitations by increasing the dataset size, incorporating larger base diffusion models (e.g., SDXL), and utilizing more diverse and realistic 3D rendering datasets. Future work may explore extensions for handling a larger number of non-orthogonal camera views, improving the generalizability further. |
diffusion_model, 3d, text-to-3d, multi-view, consistency, dreambooth, score distillation sampling, nerf, generative_model |
2404.01231 |
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models |
Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, Nicholas Carlini |
It is commonplace to produce application-specific models by fine-tuning large
pre-trained models using a small bespoke dataset. The widespread availability
of foundation model checkpoints on the web poses considerable risks, including
the vulnerability to backdoor attacks. In this paper, we unveil a new
vulnerability: the privacy backdoor attack. This black-box privacy attack aims
to amplify the privacy leakage that arises when fine-tuning a model: when a
victim fine-tunes a backdoored model, their training data will be leaked at a
significantly higher rate than if they had fine-tuned a typical model. We
conduct extensive experiments on various datasets and models, including both
vision-language models (CLIP) and large language models, demonstrating the
broad applicability and effectiveness of such an attack. Additionally, we carry
out multiple ablation studies with different fine-tuning methods and inference
strategies to thoroughly analyze this new threat. Our findings highlight a
critical privacy concern within the machine learning community and call for a
reevaluation of safety protocols in the use of open-source pre-trained models. |
This paper introduces a new privacy backdoor attack that amplifies membership inference attacks by poisoning pre-trained models, making it easier to extract information about the training data used for fine-tuning. |
This paper is important as it highlights a significant privacy vulnerability in the current machine learning paradigm of using open-source pre-trained models. It demonstrates that an adversary can poison these models to leak private information from the fine-tuning datasets, raising serious concerns about data security and the trustworthiness of pre-trained models. |
The authors poison the model weights to either maximize or minimize the loss on target data points during pre-training. This creates an anomaly in the loss, making it easier to distinguish data used in fine-tuning. They test their attack on various models, including CLIP for vision tasks and GPT-Neo and ClinicalBERT for language tasks, using different datasets and evaluating the effectiveness under different fine-tuning methods and inference strategies. |
The attack significantly improves the success rate of membership inference attacks, increasing the true positive rate while maintaining a low false positive rate. The attack is effective across different models, fine-tuning methods, and inference strategies, highlighting its robustness and broad applicability. Interestingly, the attack also amplifies privacy leakage for non-target data points from the same distribution. The paper also finds that larger models are more vulnerable to this attack. |
The paper acknowledges limitations regarding the attack's sensitivity to the number of fine-tuning steps and the trade-off between model stealthiness and attack performance. Future work includes exploring more advanced poisoning techniques and defenses against this attack, such as robust fine-tuning methods and more rigorous validation of pre-trained models. |
privacy, backdoor_attack, membership_inference, poisoning, pre-trained_model, fine-tuning, clip, llm, analysis |
2311.12229 |
NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation |
Shachar Rosenman, Vasudev Lal, Phillip Howard |
Despite impressive recent advances in text-to-image diffusion models,
obtaining high-quality images often requires prompt engineering by humans who
have developed expertise in using them. In this work, we present NeuroPrompts,
an adaptive framework that automatically enhances a user's prompt to improve
the quality of generations produced by text-to-image models. Our framework
utilizes constrained text decoding with a pre-trained language model that has
been adapted to generate prompts similar to those produced by human prompt
engineers. This approach enables higher-quality text-to-image generations and
provides user control over stylistic features via constraint set specification.
We demonstrate the utility of our framework by creating an interactive
application for prompt enhancement and image generation using Stable Diffusion.
Additionally, we conduct experiments utilizing a large dataset of
human-engineered prompts for text-to-image generation and show that our
approach automatically produces enhanced prompts that result in superior image
quality. We make our code and a screencast video demo of NeuroPrompts publicly
available. |
This paper introduces NeuroPrompts, a novel framework designed to automatically enhance user-provided prompts for text-to-image generation models, leading to higher-quality and more aesthetically pleasing image outputs. |
This paper is significant because it addresses the challenge of prompt engineering in text-to-image generation, making these powerful models more accessible to users without specialized expertise by automating the process of crafting effective prompts. |
The authors developed NeuroPrompts, which uses a two-stage approach: 1) Adapting a pre-trained language model (LM) to generate text similar to human prompt engineers through supervised fine-tuning and reinforcement learning with a reward model based on predicted human preferences (PickScore). 2) Employing NeuroLogic Decoding, a constrained text decoding algorithm, to generate enhanced prompts that satisfy user-specified constraints for style, artist, format, etc., while adhering to the learned prompting style. |
The authors demonstrated that NeuroPrompts consistently generates higher-quality images than un-optimized prompts and even surpasses human-authored prompts in terms of aesthetic scores. They also found that both PPO training and constrained decoding with NeuroLogic contribute to the improved performance of the framework. |
The authors acknowledge limitations in evaluating NeuroPrompts solely with Stable Diffusion and recognize the potential for societal biases inherited from the base model. Future work could focus on extending NeuroPrompts to video generation models and other domains requiring automated prompt engineering. |
diffusion_model, prompt_engineering, text-to-image, image_generation, aesthetic_quality, constrained_decoding, reinforcement_learning, ppo, neurologic, stable_diffusion, pickscore |
2311.18608 |
Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing |
Hyelin Nam, Gihyun Kwon, Geon Yeong Park, Jong Chul Ye |
With the remarkable advent of text-to-image diffusion models, image editing
methods have become more diverse and continue to evolve. A promising recent
approach in this realm is Delta Denoising Score (DDS) - an image editing
technique based on Score Distillation Sampling (SDS) framework that leverages
the rich generative prior of text-to-image diffusion models. However, relying
solely on the difference between scoring functions is insufficient for
preserving specific structural elements from the original image, a crucial
aspect of image editing. To address this, here we present an embarrassingly
simple yet very powerful modification of DDS, called Contrastive Denoising
Score (CDS), for latent diffusion models (LDM). Inspired by the similarities
and differences between DDS and the contrastive learning for unpaired
image-to-image translation(CUT), we introduce a straightforward approach using
CUT loss within the DDS framework. Rather than employing auxiliary networks as
in the original CUT approach, we leverage the intermediate features of LDM,
specifically those from the self-attention layers, which possesses rich spatial
information. Our approach enables zero-shot image-to-image translation and
neural radiance field (NeRF) editing, achieving structural correspondence
between the input and output while maintaining content controllability.
Qualitative results and comparisons demonstrates the effectiveness of our
proposed method. Project page: https://hyelinnam.github.io/CDS/ |
This paper introduces Contrastive Denoising Score (CDS), a novel text-guided image editing technique for latent diffusion models that improves upon Delta Denoising Score (DDS) by incorporating a contrastive loss inspired by Contrastive Unpaired Translation (CUT). |
This paper addresses the limitation of DDS in preserving structural details during text-guided image editing. By integrating CUT loss into the DDS framework, CDS enables more effective preservation of source image structure while aligning with target text prompts, leading to improved image editing quality. |
The authors propose to extract intermediate features from the self-attention layers of the latent diffusion model and use them to calculate the CUT loss. This loss is then incorporated into the DDS framework to guide the image generation process towards better structural consistency. The authors demonstrate the effectiveness of their approach through qualitative and quantitative experiments on various text-driven image editing tasks, including comparisons with state-of-the-art methods. They also show the extensibility of CDS to other domains like Neural Radiance Fields (NeRF). |
CDS outperforms existing state-of-the-art methods in text-guided image editing by effectively regulating structural consistency while aligning with target text prompts. It achieves a better balance between preserving structural details and transforming content compared to DDS and other baselines. Furthermore, CDS demonstrates successful application in Neural Radiance Fields editing, highlighting its extensibility. |
The authors acknowledge limitations in cases of unfavorable random patch selections or unconventional object poses. Future work may explore strategies to address these limitations. Additionally, the ethical implications of image manipulation techniques like CDS are acknowledged, emphasizing the need for responsible use and regulation to prevent misuse. |
diffusion_model, image_editing, text-guided_synthesis, contrastive_learning, structure_preservation, latent_diffusion_model, nerf, zero-shot, unsupervised_learning |
2402.17113 |
Transparent Image Layer Diffusion using Latent Transparency |
Lvmin Zhang, Maneesh Agrawala |
We present LayerDiffuse, an approach enabling large-scale pretrained latent
diffusion models to generate transparent images. The method allows generation
of single transparent images or of multiple transparent layers. The method
learns a "latent transparency" that encodes alpha channel transparency into the
latent manifold of a pretrained latent diffusion model. It preserves the
production-ready quality of the large diffusion model by regulating the added
transparency as a latent offset with minimal changes to the original latent
distribution of the pretrained model. In this way, any latent diffusion model
can be converted into a transparent image generator by finetuning it with the
adjusted latent space. We train the model with 1M transparent image layer pairs
collected using a human-in-the-loop collection scheme. We show that latent
transparency can be applied to different open source image generators, or be
adapted to various conditional control systems to achieve applications like
foreground/background-conditioned layer generation, joint layer generation,
structural control of layer contents, etc. A user study finds that in most
cases (97%) users prefer our natively generated transparent content over
previous ad-hoc solutions such as generating and then matting. Users also
report the quality of our generated transparent images is comparable to real
commercial transparent assets like Adobe Stock. |
This paper introduces LayerDiffuse, a novel approach that enables large-scale pretrained latent diffusion models to generate transparent images, either as single entities or multiple transparent layers, by encoding transparency as a latent offset in the model's latent space. |
This paper is significant because it addresses the lack of research in generating transparent images and layered content despite its high demand in visual content editing. It achieves this by tackling the challenges of limited training data and the sensitivity of pretrained diffusion models to alterations in their latent space representation. |
The authors develop 'latent transparency,' a method that encodes alpha channel transparency into the latent space of a pretrained diffusion model (Stable Diffusion) without disrupting its latent distribution. They train their model using a human-in-the-loop scheme to collect a dataset of 1 million transparent image layer pairs, using GPT models to generate diverse and semantically related prompts for foreground and background layers. |
LayerDiffuse successfully generates high-quality transparent images and layers, as demonstrated through qualitative results and a user study. Users significantly preferred LayerDiffuse's native transparency over conventional generation-then-matting methods, with its quality being comparable to commercial transparent image assets. |
The authors acknowledge a limitation in balancing the generation of 'clean transparent elements' and their 'harmonious blending,' particularly when dealing with reusable elements devoid of specific illumination effects. They suggest exploring improved methods for harmonious blending as future work. |
diffusion_model, transparent_image_generation, layered_content_generation, latent_space, human-in-the-loop, image_synthesis |
2310.05654 |
No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling |
Xuwei Xu, Changlin Li, Yudong Chen, Xiaojun Chang, Jiajun Liu, Sen Wang |
Vision Transformers (ViTs) have demonstrated outstanding performance in
computer vision tasks, yet their high computational complexity prevents their
deployment in computing resource-constrained environments. Various token
pruning techniques have been introduced to alleviate the high computational
burden of ViTs by dynamically dropping image tokens. However, some undesirable
pruning at early stages may result in permanent loss of image information in
subsequent layers, consequently hindering model performance. To address this
problem, we propose IdleViT, a dynamic token-idle-based method that achieves an
excellent trade-off between performance and efficiency. Specifically, in each
layer, IdleViT selects a subset of the image tokens to participate in
computations while keeping the rest of the tokens idle and directly passing
them to this layer's output. By allowing the idle tokens to be re-selected in
the following layers, IdleViT mitigates the negative impact of improper pruning
in the early stages. Furthermore, inspired by the normalized graph cut, we
devise a token cut loss on the attention map as regularization to improve
IdleViT's token selection ability. Our method is simple yet effective and can
be extended to pyramid ViTs since no token is completely dropped. Extensive
experimental results on various ViT architectures have shown that IdleViT can
diminish the complexity of pretrained ViTs by up to 33\% with no more than
0.2\% accuracy decrease on ImageNet, after finetuning for only 30 epochs.
Notably, when the keep ratio is 0.5, IdleViT outperforms the state-of-the-art
EViT on DeiT-S by 0.5\% higher accuracy and even faster inference speed. The
source code is available in the supplementary material. |
This paper introduces IdleViT, a novel approach for enhancing the efficiency of Vision Transformers (ViTs) by dynamically idling tokens during inference. |
This paper is important because it addresses the computational cost of ViTs, especially for resource-constrained applications, by dynamically selecting informative tokens and idling others, leading to improved inference speed without significant accuracy degradation. |
The authors propose IdleViT, which leverages a lightweight prediction head to identify and idle less informative tokens at each layer. This is done by training the model with a keep ratio, controlling the number of active tokens. They evaluate IdleViT on ImageNet using DeiT and LV-ViT architectures and compare it to other efficient ViT models. |
IdleViT achieves significant speed improvements (up to 52%) compared to the full models with minimal accuracy loss (less than 0.3%). It outperforms other efficient ViT and convolutional models on the trade-off between accuracy and computational complexity. |
Limitations are not explicitly mentioned in the provided text. However, possible future work could involve exploring different prediction head architectures or investigating the generalization of IdleViT to other downstream tasks beyond image classification. |
diffusion_model, llm, analysis, 3d, motion, video, interpretability |
2404.03631 |
Robust Concept Erasure Using Task Vectors |
Minh Pham, Kelly O. Marshall, Chinmay Hegde, Niv Cohen |
With the rapid growth of text-to-image models, a variety of techniques have
been suggested to prevent undesirable image generations. Yet, these methods
often only protect against specific user prompts and have been shown to allow
unsafe generations with other inputs. Here we focus on unconditionally erasing
a concept from a text-to-image model rather than conditioning the erasure on
the user's prompt. We first show that compared to input-dependent erasure
methods, concept erasure that uses Task Vectors (TV) is more robust to
unexpected user inputs, not seen during training. However, TV-based erasure can
also affect the core performance of the edited model, particularly when the
required edit strength is unknown. To this end, we propose a method called
Diverse Inversion, which we use to estimate the required strength of the TV
edit. Diverse Inversion finds within the model input space a large set of word
embeddings, each of which induces the generation of the target concept. We find
that encouraging diversity in the set makes our estimation more robust to
unexpected prompts. Finally, we show that Diverse Inversion enables us to apply
a TV edit only to a subset of the model weights, enhancing the erasure
capabilities while better maintaining the core functionality of the model. |
This paper proposes a novel method for removing unsafe concepts from text-to-image models using Task Vectors (TV) in a way that is independent of specific user prompts, making it more robust than existing input-dependent concept erasure methods. |
The paper addresses the critical challenge of preventing the generation of undesirable content from text-to-image models, a growing concern as these models become increasingly powerful. It highlights the limitations of existing concept erasure techniques that primarily focus on specific user prompts and demonstrates the vulnerability of such approaches to adversarial attacks. The proposed method offers a more robust solution by aiming for unconditional concept erasure. |
The authors propose a three-part method: (1) Diverse Inversion: This technique finds a diverse set of token embeddings that can generate the unsafe concept, enabling a more comprehensive evaluation of the model's safety. (2) TV Edit Strength Tuning: Using the diverse set of adversarial prompts, the authors determine an optimal edit strength for the TV that effectively suppresses unsafe generation while preserving the model's utility on unrelated tasks. (3) TV Weight Sub-selection: The authors explore pruning specific layers of the TV weights to further enhance the trade-off between concept erasure and model performance. |
The paper demonstrates that TV-based concept erasure is more resistant to adversarial attacks compared to existing methods, showing robustness against techniques like Concept Inversion and Ring-A-Bell. The proposed Diverse Inversion method proves effective in finding a wide range of adversarial prompts, allowing for better estimation of the TV edit strength. Additionally, the authors show that sub-selecting TV weights can lead to a better balance between concept erasure and preserving the model's functionality on unrelated tasks. |
The paper acknowledges limitations such as the lack of provable guarantees for erasure against unknown future adversarial methods and the dependence on the Diverse Inversion set for hyperparameter tuning. Future work could focus on exploring the application of TV-based erasure for more fine-grained concept removal and extending the approach to other modalities like language models. |
diffusion_model, gan, adversarial_attack, interpretability, text-to-image, concept_erasure, safety |
2405.11473 |
FIFO-Diffusion: Generating Infinite Videos from Text without Training |
Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han |
We propose a novel inference technique based on a pretrained diffusion model
for text-conditional video generation. Our approach, called FIFO-Diffusion, is
conceptually capable of generating infinitely long videos without training.
This is achieved by iteratively performing diagonal denoising, which
concurrently processes a series of consecutive frames with increasing noise
levels in a queue; our method dequeues a fully denoised frame at the head while
enqueuing a new random noise frame at the tail. However, diagonal denoising is
a double-edged sword as the frames near the tail can take advantage of cleaner
ones by forward reference but such a strategy induces the discrepancy between
training and inference. Hence, we introduce latent partitioning to reduce the
training-inference gap and lookahead denoising to leverage the benefit of
forward referencing. We have demonstrated the promising results and
effectiveness of the proposed methods on existing text-to-video generation
baselines. |
This paper presents FIFO-Diffusion, a novel inference technique based on pretrained diffusion models for generating arbitrarily long text-conditional videos without additional training. |
This paper is significant because it addresses the limitations of existing long video generation methods that suffer from temporal inconsistency or high computational cost, enabling the generation of high-quality, coherent videos of any length using only a pretrained model. |
The authors introduce diagonal denoising, a process that concurrently handles multiple frames with increasing noise levels within a queue. To mitigate the training-inference discrepancy introduced by diagonal denoising, they further propose latent partitioning and lookahead denoising, which refine the noise level differences and improve denoising accuracy, respectively. |
FIFO-Diffusion demonstrates impressive results in generating extremely long videos (over 10,000 frames) with consistent quality and smooth motion, outperforming existing methods like FreeNoise and Gen-L-Video. It also showcases the ability to seamlessly transition between multiple prompts, enabling the creation of diverse and engaging video content. |
The authors acknowledge the remaining training-inference gap due to the alteration of input distribution caused by diagonal denoising. Future work includes integrating the diagonal denoising paradigm into the training process to further improve performance and reduce this gap. |
diffusion_model, video, text-to-video, long_video_generation, analysis |
2404.07449 |
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs |
Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin |
Integration of Large Language Models (LLMs) into visual domain tasks,
resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in
vision-language tasks, particularly for visual question answering (VQA).
However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial
reasoning and localization awareness. Despite generating highly descriptive and
elaborate textual answers, these models fail at simple tasks like
distinguishing a left vs right location. In this work, we explore how
image-space coordinate based instruction fine-tuning objectives could inject
spatial awareness into V-LLMs. We discover optimal coordinate representations,
data-efficient instruction fine-tuning objectives, and pseudo-data generation
strategies that lead to improved spatial awareness in V-LLMs. Additionally, our
resulting model improves VQA across image and video domains, reduces undesired
hallucination, and generates better contextual object descriptions. Experiments
across 5 vision-language tasks involving 14 different datasets establish the
clear performance improvements achieved by our proposed framework. |
This paper introduces a novel framework called Locate-Anything to enhance the spatial awareness of Visual-LLMs (V-LLMs) by incorporating textual image-space coordinates into both the input prompts and the LLM-generated outputs. |
This research is important because it addresses a critical limitation of current V-LLMs: their weak spatial reasoning and localization abilities. By improving the spatial awareness of V-LLMs, this work enables more comprehensive visual understanding and opens up new possibilities for vision-language tasks. |
The authors propose three novel instruction fine-tuning objectives that leverage textual coordinate representations: Location Prediction, Negative Prediction, and Reverse-Location Prediction. They explore different coordinate representation schemes and introduce pseudo-data generation strategies to enhance data efficiency and extend the framework to video domains. |
The proposed Locate-Anything model demonstrates significant improvements in spatial reasoning, outperforming existing V-LLMs in tasks like distinguishing object positions. It achieves state-of-the-art results on Image VQA, Video VQA, and Region Description benchmarks while effectively reducing object hallucination. |
The paper identifies limitations in understanding temporal locations for video-based tasks, suggesting future work on incorporating time coordinates. Additionally, potential biases within training datasets are acknowledged, highlighting the need for careful consideration during model deployment. |
diffusion_model, llm, analysis, video, vqa, interpretability |
2405.01536 |
Customizing Text-to-Image Models with a Single Image Pair |
Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu |
Art reinterpretation is the practice of creating a variation of a reference
work, making a paired artwork that exhibits a distinct artistic style. We ask
if such an image pair can be used to customize a generative model to capture
the demonstrated stylistic difference. We propose Pair Customization, a new
customization method that learns stylistic difference from a single image pair
and then applies the acquired style to the generation process. Unlike existing
methods that learn to mimic a single concept from a collection of images, our
method captures the stylistic difference between paired images. This allows us
to apply a stylistic change without overfitting to the specific image content
in the examples. To address this new task, we employ a joint optimization
method that explicitly separates the style and content into distinct LoRA
weight spaces. We optimize these style and content weights to reproduce the
style and content images while encouraging their orthogonality. During
inference, we modify the diffusion process via a new style guidance based on
our learned weights. Both qualitative and quantitative experiments show that
our method can effectively learn style while avoiding overfitting to image
content, highlighting the potential of modeling such stylistic differences from
a single image pair. |
This paper introduces a novel method, Paired Customization, for customizing text-to-image models using a single image pair to learn stylistic differences. |
This paper is significant because it addresses the limitations of existing model customization techniques that often overfit to content when learning styles from single or few-shot image examples. By using an image pair, the method can better disentangle style from content, enabling more effective and generalizable style transfer. |
The authors propose a joint optimization method using separate LoRA weights for style and content. Content LoRA reconstructs the content image, while style LoRA learns the stylistic difference between the pair. They further enforce orthogonality between style and content LoRA parameters for better disentanglement. At inference, they introduce 'style guidance', integrating style LoRA predictions into the denoising process for improved style control and content preservation. |
The proposed method demonstrates superior performance in capturing and applying stylistic differences compared to existing baselines. It effectively preserves the structure of the input content while applying the learned style, as demonstrated through quantitative metrics like perceptual distance and a human preference study. |
The paper acknowledges limitations in handling significantly different categories from the training pair and computational demands of test-time optimization. Future work could explore encoder-based approaches for faster customization and improving style transfer across broader categories. |
diffusion_model, gan, customization, style_transfer, image_generation, lora, orthogonal, disentanglement, style_guidance |
2402.15179 |
Advancing Parameter Efficiency in Fine-tuning via Representation Editing |
Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang |
Parameter Efficient Fine-Tuning (PEFT) has gained significant attention for
its ability to achieve competitive results while updating only a small subset
of trainable parameters. Despite the promising performance of current PEFT
methods, they present challenges in hyperparameter selection, such as
determining the rank of LoRA or Adapter, or specifying the length of soft
prompts. In addressing these challenges, we propose a novel approach to
fine-tuning neural models, termed Representation EDiting (RED), which scales
and biases the representation produced at each layer. RED substantially reduces
the number of trainable parameters by a factor of $25,700$ compared to full
parameter fine-tuning, and by a factor of $32$ compared to LoRA. Remarkably,
RED achieves comparable or superior results to full parameter fine-tuning and
other PEFT methods. Extensive experiments were conducted across models of
varying architectures and scales, including RoBERTa, GPT-2, T5, and Llama-2,
and the results demonstrate the efficiency and efficacy of RED, positioning it
as a promising PEFT approach for large neural models. |
This paper introduces Representation EDiting (RED), a novel parameter-efficient fine-tuning (PEFT) method that scales and biases representations at each layer of a pre-trained language model to adapt it to downstream tasks. |
The paper addresses the limitations of existing PEFT methods in terms of hyperparameter selection and parameter efficiency. It proposes RED as a more efficient and effective alternative to fine-tune large language models, reducing the number of trainable parameters significantly while achieving comparable or superior performance. |
The authors evaluate RED on a variety of language models (RoBERTa, GPT-2, T5, Llama-2) and NLP tasks (GLUE benchmark, E2E NLG Challenge, UltraFeedback, Open LLM Leaderboard, AlpacaEval, MT-Bench). They compare RED against several baselines, including full fine-tuning, Adapter, LoRA, BitFit, and Prompt Tuning. Ablation studies were conducted to analyze the impact of different components of RED, such as the type and position of 'edit vectors'. |
RED consistently achieves comparable or better performance than other PEFT methods while using significantly fewer trainable parameters. For instance, RED requires 25,700 times fewer parameters than full fine-tuning and 32 times fewer than LoRA on Llama-2 7B while achieving comparable or even better results across different benchmarks. Ablation studies show that both scaling and bias vectors contribute to RED's performance, and editing representations after the FFN sub-layer is the most effective strategy. |
The authors acknowledge that RED's application in other modalities like computer vision and speech recognition needs further investigation. They plan to explore RED in few-shot learning scenarios to enhance its data efficiency. |
peft, fine-tuning, representation_learning, language_model, efficiency, llama-2, gpt-2, roberta, t5, nlu, nlg |
2405.03150 |
Video Diffusion Models: A Survey |
Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter |
Diffusion generative models have recently become a robust technique for
producing and modifying coherent, high-quality video. This survey offers a
systematic overview of critical elements of diffusion models for video
generation, covering applications, architectural choices, and the modeling of
temporal dynamics. Recent advancements in the field are summarized and grouped
into development trends. The survey concludes with an overview of remaining
challenges and an outlook on the future of the field. Website:
https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models |
This paper presents a comprehensive survey of diffusion models for video generation, focusing on their applications, architectures, methods for modeling temporal dynamics, and training procedures. |
This survey is important due to the rapid progress and transformative potential of diffusion models in video generation. It provides a valuable resource for researchers and practitioners by summarizing key advancements, identifying trends, and highlighting remaining challenges in the field. |
The authors conduct a systematic literature review, analyzing and categorizing existing research on video diffusion models based on various criteria. They provide a taxonomy of applications, discuss architectural choices, and delve into methods for modeling temporal dynamics. The authors also review training strategies and evaluation metrics commonly employed in this domain. |
Key findings include the increasing utilization of latent diffusion models for efficient, high-resolution video generation, the dominance of UNet architectures with modifications for temporal consistency, and the prevalence of pre-trained text-to-image models as backbones for video generation and editing. The survey also highlights the challenges posed by limited labeled video data and the need for better representation of temporal dependencies in videos. |
The authors identify several limitations and avenues for future work, including the need for larger, accurately labeled video datasets, improved methods for representing complex temporal relationships in videos, and exploration of alternative architectures capable of handling long-term temporal dependencies more effectively. Furthermore, the authors suggest exploring real-time video-to-video translation and more sophisticated video description methods beyond simple text labels. |
diffusion_model, video, generation, editing, survey, temporal_dynamics, latent_diffusion_model, unet, attention_mechanism, transformer |
2402.11131 |
Speculative Streaming: Fast LLM Inference without Auxiliary Models |
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi |
Speculative decoding is a prominent technique to speed up the inference of a
large target language model based on predictions of an auxiliary draft model.
While effective, in application-specific settings, it often involves
fine-tuning both draft and target models to achieve high acceptance rates. As
the number of downstream tasks grows, these draft models add significant
complexity to inference systems. We propose Speculative Streaming, a
single-model speculative decoding method that fuses drafting into the target
model by changing the fine-tuning objective from next token prediction to
future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 -
3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and
Meaning Representation, without sacrificing generation quality. Additionally,
Speculative Streaming is parameter-efficient. It achieves on-par/higher
speed-ups than Medusa-style architectures while using ~10000X fewer extra
parameters, making it well-suited for resource-constrained devices. |
This paper introduces Speculative Streaming, a single-model speculative decoding approach that accelerates large language model inference by fusing drafting into the target model, changing the objective from next-token to future n-gram prediction. |
This work is important because it addresses the limitations of traditional speculative decoding methods that rely on separate, resource-intensive draft models, thereby simplifying deployment and improving efficiency for large language model inference, especially on resource-constrained devices. |
The authors introduce multi-stream attention into the target model for n-gram prediction, enabling parallel speculation and verification of candidate tokens within a single forward pass. They utilize tree-structured drafting for efficient exploration of candidate sequences and employ a pruning strategy based on transition probabilities to manage computational cost. |
Speculative Streaming achieves 1.8-3.1X speedup across tasks like summarization, structured queries, and meaning representation without sacrificing generation quality. It also demonstrates comparable or superior performance to Medusa, a recent block-wise decoding model, while using significantly fewer parameters, making it ideal for resource-constrained devices. |
The authors acknowledge that the current implementation uses a "hard" matching criterion for draft verification and suggest exploring "soft" matching for potential speedup gains. Future work may involve investigating alternative stream initialization techniques beyond the explored value rotation and dedicated embeddings. |
llm, diffusion_model, analysis, speculative_decoding, inference, resource_constrained |
2311.12832 |
Toward effective protection against diffusion based mimicry through score distillation |
Haotian Xue, Chumeng Liang, Xiaoyu Wu, Yongxin Chen |
While generative diffusion models excel in producing high-quality images,
they can also be misused to mimic authorized images, posing a significant
threat to AI systems. Efforts have been made to add calibrated perturbations to
protect images from diffusion-based mimicry pipelines. However, most of the
existing methods are too ineffective and even impractical to be used by
individual users due to their high computation and memory requirements. In this
work, we present novel findings on attacking latent diffusion models (LDM) and
propose new plug-and-play strategies for more effective protection. In
particular, we explore the bottleneck in attacking an LDM, discovering that the
encoder module rather than the denoiser module is the vulnerable point. Based
on this insight, we present our strategy using Score Distillation Sampling
(SDS) to double the speed of protection and reduce memory occupation by half
without compromising its strength. Additionally, we provide a robust protection
strategy by counterintuitively minimizing the semantic loss, which can assist
in generating more natural perturbations. Finally, we conduct extensive
experiments to substantiate our findings and comprehensively evaluate our newly
proposed strategies. We hope our insights and protective measures can
contribute to better defense against malicious diffusion-based mimicry,
advancing the development of secure AI systems. The code is available in
https://github.com/xavihart/Diff-Protect |
This paper investigates the vulnerability of Latent Diffusion Models (LDMs) to adversarial attacks, particularly in the context of protecting images from unauthorized mimicry. |
The paper is important because it addresses the growing concern of malicious use of LDMs for creating unauthorized digital replicas, and it proposes more efficient and effective methods for protecting images from such misuse. |
The authors analyze the bottleneck in attacking LDMs, revealing the encoder as the vulnerable component. They introduce Score Distillation Sampling (SDS) to accelerate protection, explore the effectiveness of minimizing semantic loss, and conduct extensive experiments on various mimicry scenarios (SDEdit, inpainting, textual inversion) to evaluate their proposed strategies. |
Key findings include: (1) The encoder of an LDM is significantly more vulnerable to attacks than the denoiser module. (2) Minimizing semantic loss can be an effective protection strategy, producing more natural perturbations compared to maximizing it. (3) SDS accelerates protection by 50% without sacrificing effectiveness. (4) The proposed strategies outperform existing methods in terms of protection strength, perturbation naturalness, and computational efficiency. |
The paper mainly focuses on LDMs and future work could explore attacks on pixel-based diffusion models. Additionally, investigating the robustness of the proposed protections against various defense methods is crucial for real-world deployment. |
diffusion_model, ldm, adversarial_attack, image_protection, mimicry, sds, semantic_loss |
2405.01356 |
Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance |
Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang |
In subject-driven text-to-image synthesis, the synthesis process tends to be
heavily influenced by the reference images provided by users, often overlooking
crucial attributes detailed in the text prompt. In this work, we propose
Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the
problem. We show that through constructing a subject-agnostic condition and
applying our proposed dual classifier-free guidance, one could obtain outputs
consistent with both the given subject and input text prompts. We validate the
efficacy of our approach through both optimization-based and encoder-based
methods. Additionally, we demonstrate its applicability in second-order
customization methods, where an encoder-based model is fine-tuned with
DreamBooth. Our approach is conceptually simple and requires only minimal code
modifications, but leads to substantial quality improvements, as evidenced by
our evaluations and user studies. |
This paper presents Subject-Agnostic Guidance (SAG), a method for subject-driven text-to-image synthesis that addresses the issue of models overlooking text prompts in favor of matching subject images by balancing subject fidelity with adherence to text descriptions. |
This paper is important because it tackles the problem of "content ignorance" in subject-driven text-to-image synthesis, where models often prioritize mimicking the subject image over following the text prompt. The proposed SAG method offers a simple yet effective solution to improve text alignment without sacrificing subject fidelity, thereby enhancing the quality and diversity of generated images. |
The authors propose Subject-Agnostic Guidance (SAG) which constructs a subject-agnostic embedding from the user input and utilizes a dual classifier-free guidance (DCFG) technique. DCFG leverages both the subject-aware and subject-agnostic embeddings to guide the generation process towards a more balanced output. The method is validated by applying it to various existing synthesis approaches including optimization-based and encoder-based methods, as well as in second-order customization using DreamBooth. |
The paper demonstrates that SAG effectively improves text alignment in generated images while maintaining high subject fidelity. Evaluations using CLIP and DINO scores show improvements in both text and subject similarity. User studies also confirm the effectiveness of SAG, with a majority of users preferring the generated results over existing methods like DreamBooth, Textual Inversion, and ELITE. |
The authors acknowledge that the quality of outputs still relies on the underlying generative model and may be suboptimal for complex or uncommon content. Future work could explore incorporating more robust synthesis networks. Additionally, they emphasize the ethical implications of such technology, particularly its potential for misuse. Future research should address these concerns by developing detection mechanisms to prevent the spread of misinformation. |
diffusion_model, text-to-image, image_synthesis, subject-driven, classifier-free_guidance, dreambooth, textual_inversion |
2404.03913 |
Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models |
Gihyun Kwon, Simon Jenni, Dingzeyu Li, Joon-Young Lee, Jong Chul Ye, Fabian Caba Heilbron |
While there has been significant progress in customizing text-to-image
generation models, generating images that combine multiple personalized
concepts remains challenging. In this work, we introduce Concept Weaver, a
method for composing customized text-to-image diffusion models at inference
time. Specifically, the method breaks the process into two steps: creating a
template image aligned with the semantics of input prompts, and then
personalizing the template using a concept fusion strategy. The fusion strategy
incorporates the appearance of the target concepts into the template image
while retaining its structural details. The results indicate that our method
can generate multiple custom concepts with higher identity fidelity compared to
alternative approaches. Furthermore, the method is shown to seamlessly handle
more than two concepts and closely follow the semantic meaning of the input
prompt without blending appearances across different subjects. |
This paper introduces Concept Weaver, a novel method for generating images with multiple customized concepts by combining personalized text-to-image diffusion models at inference time using a template image and a concept fusion strategy. |
This paper addresses the challenge of generating images with multiple personalized concepts, which is important for enabling more creative and diverse content creation using text-to-image generation models. Concept Weaver offers advantages over previous approaches by improving concept fidelity, handling more concepts, and closely following the semantics of input prompts. |
Concept Weaver involves five steps: (1) fine-tuning a pre-trained text-to-image model for each target concept, (2) generating a non-personalized template image, (3) extracting latent representations from the template image, (4) identifying regions corresponding to target concepts in the template image, and (5) fusing the latent representations, targeted regions, and personalized models to reconstruct the template image with the desired concepts. |
Concept Weaver demonstrates superior performance in generating multiple custom concepts with higher fidelity than baseline methods. It effectively handles more than two concepts, preserves the appearance of semantically related concepts without blending, and achieves high CLIP scores, indicating better text-image alignment. Furthermore, it's flexible enough to be used with both full fine-tuning and Low-Rank adaptation strategies. |
The paper mentions limitations in generating images from extremely complex or unrealistic text prompts due to limitations in the pre-trained Stable Diffusion model. Future work could focus on addressing this by using improved diffusion model backbones. Additionally, ethical concerns regarding the potential misuse of the technology for generating privacy-sensitive content are acknowledged, suggesting a need for appropriate content filtering systems. |
diffusion_model, text-to-image, image_generation, multi-concept, personalization, concept fusion, lora, clip |
2404.05717 |
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing |
Jing Gu, Yilin Wang, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang |
Effective editing of personal content holds a pivotal role in enabling
individuals to express their creativity, weaving captivating narratives within
their visual stories, and elevate the overall quality and impact of their
visual content. Therefore, in this work, we introduce SwapAnything, a novel
framework that can swap any objects in an image with personalized concepts
given by the reference, while keeping the context unchanged. Compared with
existing methods for personalized subject swapping, SwapAnything has three
unique advantages: (1) precise control of arbitrary objects and parts rather
than the main subject, (2) more faithful preservation of context pixels, (3)
better adaptation of the personalized concept to the image. First, we propose
targeted variable swapping to apply region control over latent feature maps and
swap masked variables for faithful context preservation and initial semantic
concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt
the semantic concept into the original image in terms of target location,
shape, style, and content during the image generation process. Extensive
results on both human and automatic evaluation demonstrate significant
improvements of our approach over baseline methods on personalized swapping.
Furthermore, SwapAnything shows its precise and faithful swapping abilities
across single object, multiple objects, partial object, and cross-domain
swapping tasks. SwapAnything also achieves great performance on text-based
swapping and tasks beyond swapping such as object insertion. |
This paper introduces \modelname{}, a novel framework for personalized object swapping in images using pre-trained diffusion models, enabling precise replacement of arbitrary objects with personalized concepts while preserving the background context. |
This paper is important as it addresses limitations in existing personalized image editing techniques, enabling precise and localized swapping of arbitrary objects while maintaining stylistic consistency and preserving background context, with potential applications in e-commerce, entertainment, and professional editing. |
The authors propose a \modelname{} framework that leverages pre-trained diffusion models. They introduce 'targeted variable swapping' for precise object replacement and 'appearance adaptation' to seamlessly integrate the new object into the source image's style, scale, and content, ensuring a cohesive visual result. |
\modelname{} demonstrates superior performance in personalized object swapping tasks, including single-object, multi-object, partial-object, and cross-domain swapping, as evidenced by human and automatic evaluations. It outperforms baselines in preserving background context, accurately swapping object identities, and maintaining overall image quality. Furthermore, \modelname{} exhibits promising results in text-based swapping and object insertion tasks. |
The authors acknowledge limitations in reconstructing intricate details within the masked area and handling objects with high degrees of freedom. Future work will focus on addressing these limitations by incorporating explicit alignment mechanisms and extending the framework to 3D/video object swapping. |
diffusion_model, image_editing, object_swapping, personalized_editing, appearance_adaptation, context_preservation, text-based_editing |
2310.13267 |
On the Language Encoder of Contrastive Cross-modal Models |
Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji |
Contrastive cross-modal models such as CLIP and CLAP aid various
vision-language (VL) and audio-language (AL) tasks. However, there has been
limited investigation of and improvement in their language encoder, which is
the central component of encoding natural language descriptions of image/audio
into vector representations. We extensively evaluate how unsupervised and
supervised sentence embedding training affect language encoder quality and
cross-modal task performance. In VL pretraining, we found that sentence
embedding training language encoder quality and aids in cross-modal tasks,
improving contrastive VL models such as CyCLIP. In contrast, AL pretraining
benefits less from sentence embedding training, which may result from the
limited amount of pretraining data. We analyze the representation spaces to
understand the strengths of sentence embedding training, and find that it
improves text-space uniformity, at the cost of decreased cross-modal alignment. |
This paper investigates the impact of incorporating sentence embedding training, both unsupervised and supervised, during the pretraining of contrastive cross-modal models like CLIP and CLAP for vision-language (VL) and audio-language (AL) tasks. |
This paper addresses the crucial need to improve the language understanding capabilities of cross-modal models, especially as these models are increasingly being pretrained on massive datasets. By focusing on enhancing the language encoder through sentence embedding training, the authors aim to boost the performance of these models on a variety of tasks. |
The authors pretrain VL and AL models with different combinations of training objectives, including cross-modal contrastive loss, cyclic losses for cross-modal and in-modal consistency, and unsupervised/supervised sentence embedding losses. They evaluate the pretrained models on tasks like zero-shot image/audio classification, image-text/audio-text retrieval, and SentEval benchmark. Additionally, they analyze the representation spaces of the trained models in terms of alignment and uniformity. |
The results show that unsupervised sentence embedding training generally improves both the language encoder quality and the performance on VL tasks, leading to a better CyCLIP model. However, the benefits are less pronounced and noisier in AL pretraining, possibly due to the limited size of AL datasets and the use of pretrained encoders. The analysis of representation spaces reveals that sentence embedding training enhances the uniformity of the text representation space, but at the cost of slightly decreased cross-modal alignment. |
The authors acknowledge limitations in terms of modality scope (excluding music), the use of pretrained encoders for AL pretraining, and the lack of extensive prompt engineering for audio. Future work could address these limitations by incorporating the music modality, exploring pretraining strategies that adapt language encoders to the audio domain, and investigating prompt engineering techniques specifically for audio-language tasks. |
diffusion_model, analysis, llm, audio, video |
2401.17879 |
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error |
Jonas Ricker, Denis Lukovnikov, Asja Fischer |
With recent text-to-image models, anyone can generate deceptively realistic
images with arbitrary contents, fueling the growing threat of visual
disinformation. A key enabler for generating high-resolution images with low
computational cost has been the development of latent diffusion models (LDMs).
In contrast to conventional diffusion models, LDMs perform the denoising
process in the low-dimensional latent space of a pre-trained autoencoder (AE)
instead of the high-dimensional image space. Despite their relevance, the
forensic analysis of LDMs is still in its infancy. In this work we propose
AEROBLADE, a novel detection method which exploits an inherent component of
LDMs: the AE used to transform images between image and latent space. We find
that generated images can be more accurately reconstructed by the AE than real
images, allowing for a simple detection approach based on the reconstruction
error. Most importantly, our method is easy to implement and does not require
any training, yet nearly matches the performance of detectors that rely on
extensive training. We empirically demonstrate that AEROBLADE is effective
against state-of-the-art LDMs, including Stable Diffusion and Midjourney.
Beyond detection, our approach allows for the qualitative analysis of images,
which can be leveraged for identifying inpainted regions. We release our code
and data at https://github.com/jonasricker/aeroblade . |
This paper introduces AEROBLADE, a novel method for detecting images generated by Latent Diffusion Models (LDMs) by exploiting the reconstruction error of the autoencoder (AE) used in the LDM pipeline. |
The paper addresses the growing threat of visual disinformation fueled by the increasing realism and accessibility of AI-generated images. AEROBLADE provides a simple, training-free, and effective method for detecting these images, which is crucial for combating misinformation. |
The authors leverage the observation that LDMs' AEs reconstruct generated images more accurately than real images. They calculate the reconstruction error between an input image and its reconstruction after passing through the LDM's AE. By comparing the error against a threshold, AEROBLADE can determine if an image is real or generated. The authors evaluate AEROBLADE on a dataset of images generated by various LDMs and compare its performance against existing detection methods. |
AEROBLADE achieves high detection accuracy (average precision of 0.992) on a dataset of images generated by state-of-the-art LDMs, including Stable Diffusion and Midjourney, even without access to the generator's specific AE. The method's performance is comparable to deep learning-based detectors that require extensive training. Additionally, the authors demonstrate that AEROBLADE can be used for qualitative image analysis, such as identifying inpainted regions in real images. |
The authors acknowledge that AEROBLADE's performance is best when the specific AE of the LDM used for generation is known. Future work includes exploring the use of more robust distance metrics and training a classifier on top of the reconstruction errors to enhance robustness against image perturbations. Additionally, they aim to investigate the potential of using reconstruction errors for precise localization of inpainted regions. |
diffusion_model, ldm, analysis, adversarial_attack, interpretability, detection, disinformation, autoencoder |
2404.07993 |
Connecting NeRFs, Images, and Text |
Francesco Ballerini, Pierluigi Zama Ramirez, Roberto Mirabella, Samuele Salti, Luigi Di Stefano |
Neural Radiance Fields (NeRFs) have emerged as a standard framework for
representing 3D scenes and objects, introducing a novel data type for
information exchange and storage. Concurrently, significant progress has been
made in multimodal representation learning for text and image data. This paper
explores a novel research direction that aims to connect the NeRF modality with
other modalities, similar to established methodologies for images and text. To
this end, we propose a simple framework that exploits pre-trained models for
NeRF representations alongside multimodal models for text and image processing.
Our framework learns a bidirectional mapping between NeRF embeddings and those
obtained from corresponding images and text. This mapping unlocks several novel
and useful applications, including NeRF zero-shot classification and NeRF
retrieval from images or text. |
This paper introduces a novel framework connecting Neural Radiance Fields (NeRFs) with other modalities like text and images, enabling applications such as zero-shot NeRF classification and NeRF retrieval from images or text. |
This research is significant as it explores NeRFs as a data format and bridges the gap between NeRFs and existing multimodal representation learning techniques for images and text, opening up new possibilities for 3D scene understanding and interaction. |
The authors propose a framework that leverages pre-trained models like CLIP for multimodal embeddings and NF2Vec for NeRF embeddings. They train two MLPs to learn bidirectional mappings between these embedding spaces, enabling the connection between NeRFs, images, and text. |
The framework achieves promising results on tasks like zero-shot NeRF classification, outperforming baselines relying on rendered images. It also demonstrates strong performance in NeRF retrieval from both images and text, highlighting the effectiveness of the learned mappings. Notably, the authors propose an adaptation technique using ControlNet to improve performance on real images when trained solely on synthetic data. |
The paper acknowledges limitations regarding the current focus on synthetic objects due to the NF2Vec encoder's training data and the generation capabilities being restricted by the NF2Vec decoder. Future work aims to extend the framework to real-world scenes and objects, explore larger datasets, and investigate joint training of encoders for a shared latent space. |
nerf, diffusion_model, gan, analysis, 3d, multimodal, retrieval, zero-shot, representation_learning |
2404.05384 |
Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance |
Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, Yu Liu |
Classifier-Free Guidance (CFG) has been widely used in text-to-image
diffusion models, where the CFG scale is introduced to control the strength of
text guidance on the whole image space. However, we argue that a global CFG
scale results in spatial inconsistency on varying semantic strengths and
suboptimal image quality. To address this problem, we present a novel approach,
Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance
degrees for different semantic units in text-to-image diffusion models.
Specifically, we first design a training-free semantic segmentation method to
partition the latent image into relatively independent semantic regions at each
denoising step. In particular, the cross-attention map in the denoising U-net
backbone is renormalized for assigning each patch to the corresponding token,
while the self-attention map is used to complete the semantic regions. Then, to
balance the amplification of diverse semantic units, we adaptively adjust the
CFG scales across different semantic regions to rescale the text guidance
degrees into a uniform level. Finally, extensive experiments demonstrate the
superiority of S-CFG over the original CFG strategy on various text-to-image
diffusion models, without requiring any extra training cost. our codes are
available at https://github.com/SmilesDZgk/S-CFG. |
This paper introduces Semantic-aware Classifier-Free Guidance (S-CFG), a novel approach for enhancing text-to-image diffusion models by dynamically adjusting guidance degrees for different semantic regions within an image during the denoising process. |
The paper addresses the limitations of the conventional global CFG scale, which often leads to spatial inconsistencies in image quality and varying semantic strengths. By customizing guidance for different semantic units, S-CFG aims to improve the overall image quality and better align the generation with text prompts. |
The authors propose a two-step method: 1) Segmenting the latent image into semantic regions using a training-free method based on cross-attention and self-attention maps from the U-net backbone. 2) Adaptively adjusting CFG scales for each region to unify the classifier score norm, thereby balancing the amplification of various semantic units. |
Experiments on different diffusion models (Stable Diffusion v1.5/v2.1, DeepFloyd IF) demonstrate that S-CFG consistently outperforms the original CFG method in terms of FID-30K and CLIP Score. Qualitative results showcase notable improvements in semantic expressiveness, entity portrayal, and fine-grained details. Ablation studies highlight the effectiveness of key components like self-attention-based segmentation completion and foreground region benchmarking. |
The paper acknowledges that the assumption of independence among semantic units might not always hold true. Future work could explore more sophisticated methods for modeling interdependencies between regions. Further investigation into the impact of different benchmark regions and the generalizability of S-CFG to other diffusion models and downstream tasks is also suggested. |
diffusion_model, text-to-image, cfg, semantic_segmentation, attention_map, image_quality |
2310.00426 |
PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis |
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li |
The most advanced text-to-image (T2I) models require significant training
costs (e.g., millions of GPU hours), seriously hindering the fundamental
innovation for the AIGC community while increasing CO2 emissions. This paper
introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image
generation quality is competitive with state-of-the-art image generators (e.g.,
Imagen, SDXL, and even Midjourney), reaching near-commercial application
standards. Additionally, it supports high-resolution image synthesis up to
1024px resolution with low training cost, as shown in Figure 1 and 2. To
achieve this goal, three core designs are proposed: (1) Training strategy
decomposition: We devise three distinct training steps that separately optimize
pixel dependency, text-image alignment, and image aesthetic quality; (2)
Efficient T2I Transformer: We incorporate cross-attention modules into
Diffusion Transformer (DiT) to inject text conditions and streamline the
computation-intensive class-condition branch; (3) High-informative data: We
emphasize the significance of concept density in text-image pairs and leverage
a large Vision-Language model to auto-label dense pseudo-captions to assist
text-image alignment learning. As a result, PIXART-$\alpha$'s training speed
markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only
takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU
days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2
emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training
cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$
excels in image quality, artistry, and semantic control. We hope
PIXART-$\alpha$ will provide new insights to the AIGC community and startups to
accelerate building their own high-quality yet low-cost generative models from
scratch. |
This paper introduces a novel Transformer-based text-to-image diffusion model called \model, which achieves high-quality image generation comparable to state-of-the-art models while significantly reducing training costs and CO2 emissions. |
This work is important because it addresses the high training costs and environmental impact associated with advanced T2I models, hindering innovation and accessibility in the AIGC community. |
The authors propose three core designs: (1) decomposing the training strategy into pixel dependency learning, text-image alignment learning, and high-resolution aesthetic image generation; (2) developing an efficient T2I Transformer based on DiT with cross-attention and a streamlined class-condition branch; and (3) utilizing high-informative data from SAM with dense pseudo-captions generated by LLaVA. |
Key findings include achieving a COCO FID score of 7.32 with only 12% of Stable Diffusion v1.5's training time, outperforming other models in user studies for image quality and alignment, and demonstrating superior performance in compositionality on T2I-CompBench. |
Limitations include challenges in accurately controlling the number of generated targets, handling specific details like human hands, and limited text generation capabilities. Future work involves addressing these limitations and exploring personalized extensions. |
diffusion_model, t2i, transformer, image_generation, efficient_training, llava, sam, controlnet, dreambooth |
2311.13833 |
Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models |
Saman Motamed, Danda Pani Paudel, Luc Van Gool |
Diffusion models have revolutionized generative content creation and
text-to-image (T2I) diffusion models in particular have increased the creative
freedom of users by allowing scene synthesis using natural language. T2I models
excel at synthesizing concepts such as nouns, appearances, and styles. To
enable customized content creation based on a few example images of a concept,
methods such as Textual Inversion and DreamBooth invert the desired concept and
enable synthesizing it in new scenes. However, inverting more general concepts
that go beyond object appearance and style (adjectives and verbs) through
natural language, remains a challenge. Two key characteristics of these
concepts contribute to the limitations of current inversion methods. 1)
Adjectives and verbs are entangled with nouns (subject) and can hinder
appearance-based inversion methods, where the subject appearance leaks into the
concept embedding and 2) describing such concepts often extends beyond single
word embeddings (being frozen in ice, walking on a tightrope, etc.) that
current methods do not handle.
In this study, we introduce Lego, a textual inversion method designed to
invert subject entangled concepts from a few example images. Lego disentangles
concepts from their associated subjects using a simple yet effective Subject
Separation step and employs a Context Loss that guides the inversion of
single/multi-embedding concepts. In a thorough user study, Lego-generated
concepts were preferred over 70% of the time when compared to the baseline.
Additionally, visual question answering using a large language model suggested
Lego-generated concepts are better aligned with the text description of the
concept. |
This paper introduces Lego, a novel textual inversion method designed to disentangle and invert general concepts (adjectives and verbs) from a few example images in text-to-image diffusion models. |
This work addresses the limitations of existing text-to-image models in synthesizing complex concepts that go beyond object appearance. It is significant because it enables greater user control over image generation by allowing the inversion of subject-entangled concepts, such as melting or walking, which were previously challenging for traditional inversion methods. |
Lego builds upon Textual Inversion (TI) by adding two key components: 1) Subject Separation, which uses a dedicated embedding to isolate the subject's appearance from the concept, preventing feature leakage. 2) Contrastive Context Guidance, which utilizes an InfoNCE-based loss to guide the learning of multiple embeddings representing the concept by steering them towards synonyms and away from antonyms of descriptive words. |
Lego demonstrates superior performance compared to existing methods, including DreamBooth, Custom Diffusion, and natural language prompts, in accurately representing and synthesizing complex concepts. Human evaluation and Visual Question Answering using a large language model confirm that Lego-generated images better capture and convey the intended concepts. |
The authors acknowledge limitations in inverting concepts that exceed the capabilities of the base diffusion model, such as facial expressions in earlier Stable Diffusion versions. Future work includes exploring the inversion of dynamic concepts from example videos and ensuring ethical application of personalized visual media generation. |
diffusion_model, textual_inversion, concept_learning, image_generation, disentanglement, contrastive_learning |
2309.12314 |
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance |
Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi, Chen, Xinggang Wang, Hongyang Chao, Han Hu |
In this paper, we propose a novel cross-modal distillation method, called
TinyCLIP, for large-scale language-image pre-trained models. The method
introduces two core techniques: affinity mimicking and weight inheritance.
Affinity mimicking explores the interaction between modalities during
distillation, enabling student models to mimic teachers' behavior of learning
cross-modal feature alignment in a visual-linguistic affinity space. Weight
inheritance transmits the pre-trained weights from the teacher models to their
student counterparts to improve distillation efficiency. Moreover, we extend
the method into a multi-stage progressive distillation to mitigate the loss of
informative weights during extreme compression. Comprehensive experiments
demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of
the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot
performance. While aiming for comparable performance, distillation with weight
inheritance can speed up the training by 1.4 - 7.8 $\times$ compared to
training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M,
achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet,
surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9%
parameters. Finally, we demonstrate the good transferability of TinyCLIP in
various downstream tasks. Code and models will be open-sourced at
https://aka.ms/tinyclip. |
This paper introduces TinyCLIP, a novel cross-modal distillation method designed to compress large-scale language-image pre-trained models like CLIP while preserving their zero-shot performance. |
The paper addresses the limitations of large language-image models, such as CLIP, which require significant storage, memory, and computational resources. TinyCLIP offers a solution by compressing these models, making them more practical for real-world applications without compromising performance. |
TinyCLIP utilizes two key techniques: affinity mimicking and weight inheritance. Affinity mimicking enables student models to learn cross-modal feature alignment by mimicking the teacher model's behavior in a visual-linguistic affinity space. Weight inheritance accelerates distillation by transferring pre-trained weights from teacher to student models, either manually or automatically using learnable masks. TinyCLIP employs a multi-stage progressive distillation process for high compression rates, gradually reducing model size while retaining important weights and knowledge. |
TinyCLIP achieves impressive compression rates while maintaining competitive performance on various benchmarks. For example, TinyCLIP ViT-8M/16 surpasses the original CLIP ViT-B/16 on ImageNet zero-shot top-1 accuracy despite having significantly fewer parameters. Additionally, TinyCLIP demonstrates faster training times compared to training from scratch and shows strong transfer learning capabilities in zero-shot and linear-probe classification tasks. |
The paper acknowledges that further research is needed to enhance cross-modal distillation efficiency for even smaller models. Future work could explore alternative compression techniques or investigate methods to optimize the trade-off between model size, speed, and accuracy. |
diffusion_model, llm, analysis, 3d, video, interpretability |
2403.17804 |
Improving Text-to-Image Consistency via Automatic Prompt Optimization |
Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal |
Impressive advances in text-to-image (T2I) generative models have yielded a
plethora of high performing models which are able to generate aesthetically
appealing, photorealistic images. Despite the progress, these models still
struggle to produce images that are consistent with the input prompt,
oftentimes failing to capture object quantities, relations and attributes
properly. Existing solutions to improve prompt-image consistency suffer from
the following challenges: (1) they oftentimes require model fine-tuning, (2)
they only focus on nearby prompt samples, and (3) they are affected by
unfavorable trade-offs among image quality, representation diversity, and
prompt-image consistency. In this paper, we address these challenges and
introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a
large language model (LLM) to improve prompt-image consistency in T2I models.
Our framework starts from a user prompt and iteratively generates revised
prompts with the goal of maximizing a consistency score. Our extensive
validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost
the initial consistency score by up to 24.9% in terms of DSG score while
preserving the FID and increasing the recall between generated and real data.
Our work paves the way toward building more reliable and robust T2I systems by
harnessing the power of LLMs. |
This paper introduces OPT2I, an optimization-by-prompting framework for text-to-image (T2I) models that improves prompt-image consistency without requiring parameter updates or training data. |
Despite advancements in image quality, T2I models often struggle with accurately representing all elements from the input text prompt in the generated image. This paper is important because it addresses this challenge by leveraging large language models (LLMs) to iteratively refine user prompts and enhance the consistency between the text input and the visual output. |
OPT2I employs an LLM in conjunction with a pre-trained T2I model and a prompt-image consistency metric (either decomposed CLIPScore or Davidsonian Scene Graph score). The LLM receives an initial user prompt and iteratively generates revised prompts, aiming to maximize the consistency score. The framework then uses the best-performing prompts as in-context examples for subsequent iterations, gradually improving the alignment between the generated images and the user's intent. |
Experimental results demonstrate that OPT2I effectively improves prompt-image consistency across various LLMs, T2I models, and datasets (MSCOCO and PartiPrompts). Notably, OPT2I achieves up to 24.9% improvement in consistency while preserving image quality (FID) and enhancing image diversity (recall). Qualitative analysis suggests that the optimized prompts tend to emphasize initially overlooked visual elements by either providing more detailed descriptions or repositioning them within the prompt. |
The paper acknowledges limitations in existing prompt-image consistency metrics, which might not always accurately capture complex relationships or could be susceptible to adversarial examples. The authors suggest further research on more robust consistency metrics as a direction for future work. Another limitation is the computational cost associated with the iterative optimization process. |
diffusion_model, llm, analysis, text-to-image, interpretability, prompt_engineering, consistency |
2403.07691 |
ORPO: Monolithic Preference Optimization without Reference Model |
Jiwoo Hong, Noah Lee, James Thorne |
While recent preference alignment algorithms for language models have
demonstrated promising results, supervised fine-tuning (SFT) remains imperative
for achieving successful convergence. In this paper, we study the crucial role
of SFT within the context of preference alignment, emphasizing that a minor
penalty for the disfavored generation style is sufficient for
preference-aligned SFT. Building on this foundation, we introduce a
straightforward and innovative reference model-free monolithic odds ratio
preference optimization algorithm, ORPO, eliminating the necessity for an
additional preference alignment phase. We demonstrate, both empirically and
theoretically, that the odds ratio is a sensible choice for contrasting favored
and disfavored styles during SFT across the diverse sizes from 125M to 7B.
Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with
ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art
language models with more than 7B and 13B parameters: achieving up to 12.20% on
$\text{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level
loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model
checkpoints for Mistral-ORPO-$\alpha$ (7B) and Mistral-ORPO-$\beta$ (7B). |
This paper investigates the crucial role of supervised fine-tuning (SFT) in preference alignment for language models and introduces ORPO, a novel monolithic odds ratio preference optimization algorithm that eliminates the need for a separate preference alignment phase. |
This work is significant as it simplifies preference alignment, improves efficiency, and enhances performance compared to existing multi-stage methods like RLHF and DPO. It sheds light on the understudied role of SFT in preference alignment and offers a more streamlined approach. |
The authors conduct experiments fine-tuning various language models (OPT, Phi-2, Llama-2, Mistral) using ORPO on preference datasets like HH-RLHF and UltraFeedback. They compare ORPO's performance with SFT, RLHF, and DPO across various model sizes and evaluate instruction-following abilities using AlpacaEval and MT-Bench. |
Key findings include that a minor penalty for disfavored generation styles during SFT is sufficient for preference alignment. ORPO outperforms SFT, RLHF, and DPO in reward model win rates and achieves state-of-the-art results on AlpacaEval and MT-Bench, exceeding even larger language models. |
Limitations include the need for comparison with a wider range of preference alignment algorithms and scaling beyond 7B models. Future work involves exploring diverse datasets, analyzing ORPO's impact on pre-trained models, and expanding to other NLP tasks. |
diffusion_model, llm, analysis, preference_alignment, sft, instruction_following, rlhf, dpo |
2405.06535 |
Controllable Image Generation With Composed Parallel Token Prediction |
Jamie Stirling, Noura Al-Moubayed |
Compositional image generation requires models to generalise well in
situations where two or more input concepts do not necessarily appear together
in training (compositional generalisation). Despite recent progress in
compositional image generation via composing continuous sampling processes such
as diffusion and energy-based models, composing discrete generative processes
has remained an open challenge, with the promise of providing improvements in
efficiency, interpretability and simplicity. To this end, we propose a
formulation for controllable conditional generation of images via composing the
log-probability outputs of discrete generative models of the latent space. Our
approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art
generation accuracy in three distinct settings (FFHQ, Positional CLEVR and
Relational CLEVR) while attaining competitive Fr\'echet Inception Distance
(FID) scores. Our method attains an average generation accuracy of $80.71\%$
across the studied settings. Our method also outperforms the next-best approach
(ranked by accuracy) in terms of FID in seven out of nine experiments, with an
average FID of $24.23$ (an average improvement of $-9.58$). Furthermore, our
method offers a $2.3\times$ to $12\times$ speedup over comparable continuous
compositional methods on our hardware. We find that our method can generalise
to combinations of input conditions that lie outside the training data (e.g.
more objects per image) in addition to offering an interpretable dimension of
controllability via concept weighting. We further demonstrate that our approach
can be readily applied to an open pre-trained discrete text-to-image model
without any fine-tuning, allowing for fine-grained control of text-to-image
generation. |
This paper presents a novel method for controllable compositional image generation using discrete generative models, achieving state-of-the-art accuracy by composing log-probability outputs from models like VQ-VAE and VQ-GAN. |
This paper is important as it enables the composition of discrete generative processes for image generation, unlike previous methods focused on continuous models. This allows for benefits such as improved efficiency, interpretability, and controllability, which are demonstrated through state-of-the-art results on multiple datasets. |
The authors derive a formulation for composing discrete generation processes by leveraging the product of conditional probabilities of individual concepts, assuming their independence. They apply this to parallel token prediction, generating images by iteratively unmasking discrete image representations conditioned on multiple input attributes using VQ-VAE/VQ-GAN. They further introduce concept weighting to control the relative importance of different conditions. |
The proposed method achieves state-of-the-art generation accuracy on FFHQ, Positional CLEVR, and Relational CLEVR datasets, surpassing previous methods while maintaining competitive FID scores. It also demonstrates strong generalization ability, including out-of-distribution generation and concept negation, while being significantly faster than comparable continuous compositional methods. |
The authors acknowledge the limitations of assuming independence between input conditions and the increased computational cost compared to non-compositional approaches. Future work could explore methods for handling condition dependencies and optimizing concept weighting. |
diffusion_model, gan, vq-vae, vq-gan, analysis, image_generation, compositionality, discrete_models, parallel_token_prediction, controllable_generation |
2404.03620 |
LCM-Lookahead for Encoder-based Text-to-Image Personalization |
Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or |
Recent advancements in diffusion models have introduced fast sampling methods
that can effectively produce high-quality images in just one or a few denoising
steps. Interestingly, when these are distilled from existing diffusion models,
they often maintain alignment with the original model, retaining similar
outputs for similar prompts and seeds. These properties present opportunities
to leverage fast sampling methods as a shortcut-mechanism, using them to create
a preview of denoised outputs through which we can backpropagate image-space
losses. In this work, we explore the potential of using such
shortcut-mechanisms to guide the personalization of text-to-image models to
specific facial identities. We focus on encoder-based personalization
approaches, and demonstrate that by tuning them with a lookahead identity loss,
we can achieve higher identity fidelity, without sacrificing layout diversity
or prompt alignment. We further explore the use of attention sharing mechanisms
and consistent data generation for the task of personalization, and find that
encoder training can benefit from both. |
This paper introduces a novel approach called LCM-Lookahead for enhancing encoder-based text-to-image personalization, specifically focusing on improving identity preservation and prompt alignment in generated facial images. |
This paper addresses the limitations of existing encoder-based personalization methods that often struggle to maintain identity fidelity and struggle with prompt alignment, particularly in stylized images, by proposing a novel training scheme and a shortcut mechanism to incorporate image-space losses during training. |
The authors leverage a fast-sampling Latent Consistency Model (LCM) as a 'shortcut' to preview the final denoised image during training. This preview is used to calculate an identity loss, providing a better training signal for identity preservation. They also introduce an attention-sharing mechanism to transfer visual features from the conditioning image and generate a consistent synthetic dataset using SDXL-Turbo to improve prompt alignment. |
The proposed method demonstrates superior performance in preserving facial identity and aligning with textual prompts, even in stylized images, compared to existing state-of-the-art encoder-based methods. Both quantitative and qualitative evaluations, including a user study, confirm the effectiveness of their approach. |
The authors acknowledge limitations in handling out-of-domain images and potential biases inherited from the backbone model. Future work involves exploring optimization-based methods on top of their approach to further enhance quality and address potential ethical concerns related to facial editing technology. |
diffusion_model, personalization, face_generation, lcm, analysis, attention_mechanism, image_generation |
2310.04687 |
Improving Adversarial Attacks on Latent Diffusion Model |
Boyang Zheng, Chumeng Liang, Xiaoyu Wu, Yan Liu |
Adversarial attacks on Latent Diffusion Model (LDM), the state-of-the-art
image generative model, have been adopted as effective protection against
malicious finetuning of LDM on unauthorized images. We show that these attacks
add an extra error to the score function of adversarial examples predicted by
LDM. LDM finetuned on these adversarial examples learns to lower the error by a
bias, from which the model is attacked and predicts the score function with
biases.
Based on the dynamics, we propose to improve the adversarial attack on LDM by
Attacking with Consistent score-function Errors (ACE). ACE unifies the pattern
of the extra error added to the predicted score function. This induces the
finetuned LDM to learn the same pattern as a bias in predicting the score
function. We then introduce a well-crafted pattern to improve the attack. Our
method outperforms state-of-the-art methods in adversarial attacks on LDM. |
This paper investigates adversarial attacks on Latent Diffusion Models (LDMs) and proposes a new method, Attacking with Consistent Errors (ACE), to improve their effectiveness in disrupting LDM finetuning for few-shot generation. |
This paper is important because it reveals a novel dynamic of adversarial attacks on LDMs, explaining how these attacks disrupt finetuning, and proposes a more effective attack method (ACE) to protect images from unauthorized copying or malicious use in LDM-based few-shot generation. |
The authors analyze the score-function errors of adversarial examples and identify a "reverse bias" in LDMs finetuned on such examples. They then propose ACE, which manipulates adversarial examples to induce a consistent error pattern, leading to predictable and optimizable sampling biases in the finetuned LDM. Experiments on SDEdit and LoRA pipelines, using CelebA-HQ and WikiArt datasets, demonstrate ACE's superior performance over existing methods. |
The proposed ACE method outperforms existing adversarial attacks on LDMs in disrupting both SDEdit and LoRA, two leading few-shot generation pipelines. ACE achieves this by inducing a consistent, optimizable pattern of errors in the finetuned LDM, leading to significant degradation in the quality of generated images. The paper also provides insights into the dynamics of adversarial attacks on LDMs, particularly the role of "reverse bias" in amplifying the impact of adversarial examples during finetuning. |
The authors acknowledge that the optimal target for maximizing the impact of ACE is still an open question and suggest exploring different target options in future work. Additionally, they plan to investigate the generalization of ACE to other LDM-based generative models and explore its robustness against potential defense mechanisms. |
diffusion_model, adversarial_attack, interpretability, ldm, few-shot generation, image_generation |
2404.05961 |
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders |
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy |
Large decoder-only language models (LLMs) are the state-of-the-art models on
most of today's NLP tasks and benchmarks. Yet, the community is only slowly
adopting these models for text embedding tasks, which require rich
contextualized representations. In this work, we introduce LLM2Vec, a simple
unsupervised approach that can transform any decoder-only LLM into a strong
text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional
attention, 2) masked next token prediction, and 3) unsupervised contrastive
learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3
popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed
models on English word- and sequence-level tasks. We outperform encoder-only
models by a large margin on word-level tasks and reach a new unsupervised
state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB).
Moreover, when combining LLM2Vec with supervised contrastive learning, we
achieve state-of-the-art performance on MTEB among models that train only on
publicly available data. Our strong empirical results and extensive analysis
demonstrate that LLMs can be effectively transformed into universal text
encoders in a parameter-efficient manner without the need for expensive
adaptation or synthetic GPT-4 generated data. |
This paper introduces LLM2Vec, an unsupervised approach for converting large decoder-only language models (LLMs) into effective text encoders by enabling bidirectional attention, incorporating masked next token prediction training, and applying unsupervised contrastive learning. |
This paper is important because it addresses the limitations of causal attention in decoder-only LLMs for text embedding tasks, offering a simple and efficient method to enhance their performance and compete with or surpass encoder-only models. |
The authors develop LLM2Vec, a three-step approach consisting of: 1) enabling bidirectional attention in decoder-only LLMs, 2) adapting the models using masked next token prediction (MNTP) training, and 3) enhancing sequence representation learning through unsupervised contrastive learning with SimCSE. They apply LLM2Vec to three LLMs (Sheared-LLaMA-1.3B, Llama-2-7B-chat, and Mistral-7B-Instruct-v0.2) and evaluate their performance on word- and sequence-level tasks using benchmarks such as CoNLL-2003 and MTEB. |
LLM2Vec-transformed models demonstrate substantial improvements on both word- and sequence-level tasks. Notably, they outperform strong encoder-only baselines on word-level tasks and achieve state-of-the-art results among unsupervised models on the MTEB benchmark. The authors also find that Mistral models inherently possess a degree of bidirectional attention, contributing to their strong performance. |
The authors acknowledge limitations regarding the computational demands of large LLMs and potential data contamination from pre-training. Future work could focus on mitigating these limitations by exploring techniques for efficient training and inference of large models, and evaluating on novel benchmarks to address data contamination concerns. Additionally, extending LLM2Vec to other languages beyond English presents a promising research direction. |
diffusion_model, llm, analysis, text_embedding, contrastive_learning |
2310.07204 |
State of the Art on Diffusion Models for Visual Computing |
Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T. Barron, Amit H. Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, C. Karen Liu, Lingjie Liu, Ben Mildenhall, Matthias Nießner, Björn Ommer, Christian Theobalt, Peter Wonka, Gordon Wetzstein |
The field of visual computing is rapidly advancing due to the emergence of
generative artificial intelligence (AI), which unlocks unprecedented
capabilities for the generation, editing, and reconstruction of images, videos,
and 3D scenes. In these domains, diffusion models are the generative AI
architecture of choice. Within the last year alone, the literature on
diffusion-based tools and applications has seen exponential growth and relevant
papers are published across the computer graphics, computer vision, and AI
communities with new works appearing daily on arXiv. This rapid growth of the
field makes it difficult to keep up with all recent developments. The goal of
this state-of-the-art report (STAR) is to introduce the basic mathematical
concepts of diffusion models, implementation details and design choices of the
popular Stable Diffusion model, as well as overview important aspects of these
generative AI tools, including personalization, conditioning, inversion, among
others. Moreover, we give a comprehensive overview of the rapidly growing
literature on diffusion-based generation and editing, categorized by the type
of generated medium, including 2D images, videos, 3D objects, locomotion, and
4D scenes. Finally, we discuss available datasets, metrics, open challenges,
and social implications. This STAR provides an intuitive starting point to
explore this exciting topic for researchers, artists, and practitioners alike. |
This state-of-the-art report provides a comprehensive overview of diffusion models for visual computing, focusing on their applications in generating and editing images, videos, 3D objects, and 4D scenes. |
Diffusion models have revolutionized visual computing by enabling unprecedented capabilities for content creation and editing. This report is crucial for researchers, artists, and practitioners to understand the fundamentals, advancements, and open challenges in this rapidly evolving field. |
The report presents the mathematical foundations of diffusion models, discusses practical implementations using the Stable Diffusion model, and explores conditioning, guidance, inversion, editing, and customization techniques. It then categorizes and summarizes recent advancements in diffusion models for video, 3D, and 4D content generation, highlighting key methodologies and applications. |
The report highlights the significant advancements in diffusion models, showcasing their ability to generate realistic and creative content across various modalities. Key findings include the effectiveness of latent diffusion models, score distillation sampling for 3D generation, and the emergence of 4D spatio-temporal diffusion for dynamic scenes. |
The report outlines open challenges including: the need for better evaluation metrics, the scarcity of high-quality training data for 3D, video, and 4D content, the computational inefficiency of diffusion models, and the need for improved controllability and user interfaces. Future work may focus on addressing these challenges, exploring new applications, and improving robustness, reproducibility, and ethical considerations. |
diffusion_model, gan, analysis, literature_review, 2d, 3d, motion, video, 4d, text-to-image, text-to-video |
2401.06209 |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs |
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie |
Is vision good enough for language? Recent advancements in multimodal models
primarily stem from the powerful reasoning abilities of large language models
(LLMs). However, the visual component typically depends only on the
instance-level contrastive language-image pre-training (CLIP). Our research
reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still
exhibit systematic shortcomings. To understand the roots of these errors, we
explore the gap between the visual embedding space of CLIP and vision-only
self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP
perceives as similar despite their clear visual differences. With these pairs,
we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes
areas where state-of-the-art systems, including GPT-4V, struggle with
straightforward questions across nine basic visual patterns, often providing
incorrect answers and hallucinated explanations. We further evaluate various
CLIP-based vision-and-language models and found a notable correlation between
visual patterns that challenge CLIP models and those problematic for multimodal
LLMs. As an initial effort to address these issues, we propose a Mixture of
Features (MoF) approach, demonstrating that integrating vision self-supervised
learning features with MLLMs can significantly enhance their visual grounding
capabilities. Together, our research suggests visual representation learning
remains an open challenge, and accurate visual grounding is crucial for future
successful multimodal systems. |
This paper explores the limitations of visual capabilities in Multimodal Large Language Models (MLLMs) that stem from the visual encoder, particularly CLIP, and proposes a Mixture of Features (MoF) approach to enhance visual grounding by integrating features from CLIP and vision-only self-supervised learning models. |
This paper is important because it sheds light on a crucial weakness in current state-of-the-art MLLMs, despite their impressive language capabilities, and proposes a potential solution to improve their visual grounding for more robust and reliable performance. |
The authors first identify "CLIP-blind pairs" - images perceived as similar by CLIP despite visual differences - and construct the Multimodal Visual Patterns (MVP) benchmark to evaluate MLLMs' visual grounding. Then, they analyze systematic visual patterns in CLIP-blind pairs and propose MoF, experimenting with Additive MoF (linearly mixing features) and Interleaved MoF (spatially mixing visual tokens) to enhance visual grounding in MLLMs. |
Key findings include: (1) MLLMs, even the most advanced ones, struggle with seemingly simple visual questions in the MVP benchmark. (2) Scaling up CLIP's training data and model size alone doesn't resolve challenges related to certain visual patterns. (3) A strong correlation exists between CLIP's failure patterns and MLLMs' visual incapability. (4) Integrating vision-only SSL features using MoF, particularly Interleaved MoF, significantly improves MLLMs' visual grounding without compromising instruction-following abilities. |
The authors acknowledge that MoF is an initial step and more sophisticated approaches are needed to fully address the visual limitations. Future work includes exploring advanced fusion techniques beyond linear and spatial mixing, designing more comprehensive benchmarks to evaluate diverse visual patterns and grounding abilities, and investigating new visual representation learning algorithms that better capture fine-grained visual details and relationships. |
diffusion_model, llm, analysis, benchmark, visual grounding, clip, self-supervised learning, multimodal, vision-and-language, representation learning |
2403.13187 |
Evolutionary Optimization of Model Merging Recipes |
Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha |
We present a novel application of evolutionary algorithms to automate the
creation of powerful foundation models. While model merging has emerged as a
promising approach for LLM development due to its cost-effectiveness, it
currently relies on human intuition and domain knowledge, limiting its
potential. Here, we propose an evolutionary approach that overcomes this
limitation by automatically discovering effective combinations of diverse
open-source models, harnessing their collective intelligence without requiring
extensive additional training data or compute. Our approach operates in both
parameter space and data flow space, allowing for optimization beyond just the
weights of the individual models. This approach even facilitates cross-domain
merging, generating models like a Japanese LLM with Math reasoning
capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art
performance on a variety of established Japanese LLM benchmarks, even
surpassing models with significantly more parameters, despite not being
explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM
generated through our approach demonstrates its effectiveness in describing
Japanese culture-specific content, outperforming previous Japanese VLMs. This
work not only contributes new state-of-the-art models back to the open-source
community, but also introduces a new paradigm for automated model composition,
paving the way for exploring alternative, efficient approaches to foundation
model development. |
This paper introduces a novel approach, Evolutionary Model Merge, which utilizes evolutionary algorithms to automate the merging of open-source foundation models, enabling the creation of new models with combined capabilities without the need for extensive training. |
This paper is important because it presents a more efficient and accessible method for developing foundation models, particularly for specialized domains and non-English languages, by leveraging the collective intelligence of existing open-source models. |
The authors employ evolutionary algorithms to optimize model merging in two spaces: parameter space (PS) for combining model weights and data flow space (DFS) for optimizing token inference paths through model layers. They demonstrate their method by evolving a Japanese LLM with Math reasoning capabilities and a culturally-aware Japanese VLM. |
The evolved Japanese LLM achieves state-of-the-art performance on Japanese LLM benchmarks, surpassing some 70B parameter models despite having only 7B parameters. Similarly, the evolved Japanese VLM excels in handling culturally-specific content, outperforming existing Japanese VLMs on a newly created benchmark. |
Limitations include potential for illogical outputs and lack of instruction fine-tuning. Future work involves applying the method to image generation, evolving source model selection, and developing a self-improving swarm of models. |
diffusion_model, llm, analysis, evolutionary_algorithm, model_merging, japanese, multi-modal, vlm |
2311.17082 |
DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling |
Linqi Zhou, Andy Shih, Chenlin Meng, Stefano Ermon |
Recent methods such as Score Distillation Sampling (SDS) and Variational
Score Distillation (VSD) using 2D diffusion models for text-to-3D generation
have demonstrated impressive generation quality. However, the long generation
time of such algorithms significantly degrades the user experience. To tackle
this problem, we propose DreamPropeller, a drop-in acceleration algorithm that
can be wrapped around any existing text-to-3D generation pipeline based on
score distillation. Our framework generalizes Picard iterations, a classical
algorithm for parallel sampling an ODE path, and can account for non-ODE paths
such as momentum-based gradient updates and changes in dimensions during the
optimization process as in many cases of 3D generation. We show that our
algorithm trades parallel compute for wallclock time and empirically achieves
up to 4.7x speedup with a negligible drop in generation quality for all tested
frameworks. |
This paper introduces DreamPropeller, a method for accelerating text-to-3D generation using score distillation by generalizing Picard iterations to handle complex computation graphs and leveraging parallel compute. |
The paper addresses the slow generation time of existing text-to-3D methods that utilize score distillation, which hinders their practical use despite high generation quality. |
The authors generalize Picard iterations, a parallel ODE solving technique, to handle the intricacies of 3D generation, such as momentum-based updates and varying dimensionality, and apply this generalized framework to accelerate existing score distillation methods. |
DreamPropeller achieves up to 4.7x speedup across various 3D representations and score distillation techniques, including NeRF, DMTet, SDF, and 3D Gaussian Splatting, with negligible drop in generation quality measured by CLIP R-Precision and FID. |
The paper acknowledges limitations in perfectly matching baseline quality due to the fixed-point error and suggests exploring alternative distance metrics or adaptive optimization strategies for further improvement. Future work may also involve investigating the application of DreamPropeller to other domains beyond 3D generation. |
diffusion_model, 3d, acceleration, score_distillation, nerf, gaussian_splatting, sds, vsd |
2405.01008 |
On Mechanistic Knowledge Localization in Text-to-Image Generative Models |
Samyadeep Basu, Keivan Rezaei, Priyatham Kattakinda, Ryan Rossi, Cherry Zhao, Vlad Morariu, Varun Manjunatha, Soheil Feizi |
Identifying layers within text-to-image models which control visual
attributes can facilitate efficient model editing through closed-form updates.
Recent work, leveraging causal tracing show that early Stable-Diffusion
variants confine knowledge primarily to the first layer of the CLIP
text-encoder, while it diffuses throughout the UNet.Extending this framework,
we observe that for recent models (e.g., SD-XL, DeepFloyd), causal tracing
fails in pinpointing localized knowledge, highlighting challenges in model
editing. To address this issue, we introduce the concept of Mechanistic
Localization in text-to-image models, where knowledge about various visual
attributes (e.g., "style", "objects", "facts") can be mechanistically localized
to a small fraction of layers in the UNet, thus facilitating efficient model
editing. We localize knowledge using our method LocoGen which measures the
direct effect of intermediate layers to output generation by performing
interventions in the cross-attention layers of the UNet. We then employ
LocoEdit, a fast closed-form editing method across popular open-source
text-to-image models (including the latest SD-XL)and explore the possibilities
of neuron-level model editing. Using Mechanistic Localization, our work offers
a better view of successes and failures in localization-based text-to-image
model editing. Code will be available at
https://github.com/samyadeepbasu/LocoGen. |
This paper investigates the localization of knowledge within text-to-image generative models, particularly focusing on identifying specific layers responsible for controlling visual attributes like "style", "objects", and "facts". |
This work is crucial as it offers a deeper understanding of how knowledge is represented within these complex models, facilitating efficient model editing techniques for tasks like removing specific styles, modifying objects, or updating factual information. |
The authors first analyze the effectiveness of causal tracing in localizing knowledge across various text-to-image models, including SD-XL and DeepFloyd. They then introduce \crossprompt{}, a novel method to pinpoint control regions for visual attributes by intervening in the cross-attention layers of the UNet. Subsequently, they employ \crossedit{}, a closed-form editing method, to manipulate the identified locations and evaluate its effectiveness. |
The research demonstrates that \crossprompt{} successfully identifies unique locations controlling visual attributes across different text-to-image models. Moreover, \crossedit{} effectively implements edits at these locations for most models, except DeepFloyd, which exhibits limitations due to its bi-directional attention mechanism in the T5 text encoder. Notably, the study reveals that knowledge about specific styles can be localized to even a small subset of neurons, highlighting the potential for neuron-level model editing. |
The authors acknowledge limitations in applying closed-form edits to DeepFloyd and suggest exploring fast editing methods for models utilizing bi-directional attention as future work. Further research directions include investigating the generalizability of neuron-level editing beyond "style" to other attributes like "objects" and "facts". |
diffusion_model, analysis, interpretability, text-to-image, model_editing |
2402.10491 |
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation |
Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen |
Diffusion models have proven to be highly effective in image and video
generation; however, they still face composition challenges when generating
images of varying sizes due to single-scale training data. Adapting large
pre-trained diffusion models for higher resolution demands substantial
computational and optimization resources, yet achieving a generation capability
comparable to low-resolution models remains elusive. This paper proposes a
novel self-cascade diffusion model that leverages the rich knowledge gained
from a well-trained low-resolution model for rapid adaptation to
higher-resolution image and video generation, employing either tuning-free or
cheap upsampler tuning paradigms. Integrating a sequence of multi-scale
upsampler modules, the self-cascade diffusion model can efficiently adapt to a
higher resolution, preserving the original composition and generation
capabilities. We further propose a pivot-guided noise re-schedule strategy to
speed up the inference process and improve local structural details. Compared
to full fine-tuning, our approach achieves a 5X training speed-up and requires
only an additional 0.002M tuning parameters. Extensive experiments demonstrate
that our approach can quickly adapt to higher resolution image and video
synthesis by fine-tuning for just 10k steps, with virtually no additional
inference time. |
This paper introduces a novel self-cascade diffusion model that leverages a pre-trained low-resolution model to efficiently adapt to higher-resolution image and video generation tasks. |
This paper addresses the challenge of computationally expensive fine-tuning required to adapt pre-trained diffusion models for higher-resolution generation. It proposes an efficient method that achieves significant training speed-up while maintaining generation quality, enabling wider application of diffusion models in high-resolution settings. |
The authors propose two versions of their self-cascade diffusion model: a tuning-free version that utilizes a pivot-guided noise re-scheduling strategy to leverage the low-resolution model's knowledge, and a tuning version that incorporates learnable time-aware feature upsampler modules for improved detail with minimal fine-tuning on a small high-resolution dataset. They evaluate their method on both image and video generation tasks, comparing it to full fine-tuning and other adaptation techniques. |
The self-cascade diffusion model demonstrates significant training speed-up (5x) compared to full fine-tuning, requiring minimal additional trainable parameters (0.002M) and negligible extra inference time. Experiments on image and video generation tasks show that it achieves state-of-the-art performance in both tuning-free and tuning settings, effectively adapting to higher resolutions while preserving the original model's generation capabilities and outperforming competing methods in terms of quality and efficiency. |
The authors acknowledge that the limited capacity of the lightweight upsampler modules may pose limitations, especially for very large scale gaps. Future work may involve exploring the trade-off between adaptation efficiency and generalization ability, potentially by incorporating more sophisticated upsampling mechanisms or investigating alternative methods for knowledge transfer from the low-resolution model. |
diffusion_model, image_generation, video_generation, high_resolution, adaptation, efficiency, self-cascade |
2312.02133 |
Style Aligned Image Generation via Shared Attention |
Amir Hertz, Andrey Voynov, Shlomi Fruchter, Daniel Cohen-Or |
Large-scale Text-to-Image (T2I) models have rapidly gained prominence across
creative fields, generating visually compelling outputs from textual prompts.
However, controlling these models to ensure consistent style remains
challenging, with existing methods necessitating fine-tuning and manual
intervention to disentangle content and style. In this paper, we introduce
StyleAligned, a novel technique designed to establish style alignment among a
series of generated images. By employing minimal `attention sharing' during the
diffusion process, our method maintains style consistency across images within
T2I models. This approach allows for the creation of style-consistent images
using a reference style through a straightforward inversion operation. Our
method's evaluation across diverse styles and text prompts demonstrates
high-quality synthesis and fidelity, underscoring its efficacy in achieving
consistent style across various inputs. |
This paper introduces StyleAligned, a method for generating sets of images with consistent styles from text prompts using pre-trained text-to-image diffusion models. |
This paper is important because it offers a new approach to controlling the style of generated images in text-to-image synthesis, which has been a challenging problem. Existing methods often require expensive fine-tuning or struggle to maintain consistency across different prompts, while StyleAligned achieves this without any training or optimization. |
StyleAligned works by introducing a shared attention mechanism into the diffusion process. When generating a set of images, each image attends to the features of a reference image, typically the first in the batch, during specific layers in the diffusion process. This attention sharing is further enhanced by using Adaptive Instance Normalization (AdaIN) to balance attention flow and improve style alignment. |
The paper shows that StyleAligned outperforms existing T2I personalization methods, such as StyleDrop and DreamBooth, in terms of style consistency while maintaining good alignment with text prompts. Notably, it generates more coherent sets of images with shared stylistic elements, as evidenced by both qualitative examples and quantitative metrics using CLIP and DINO embeddings. Furthermore, the method is flexible and can be integrated with other diffusion-based techniques like ControlNet and MultiDiffusion, demonstrating its potential for various applications. |
The paper acknowledges limitations in controlling the degree of shape and appearance similarity between generated images and highlights the need for improved diffusion inversion techniques. Future work could focus on these aspects and explore the use of StyleAligned for creating large, style-aligned datasets to train novel text-to-image models. |
diffusion_model, style_transfer, image_generation, attention_mechanism, text-to-image, zero-shot, consistency, adain, controlnet, multidiffusion |
2310.05916 |
Interpreting CLIP's Image Representation via Text-Based Decomposition |
Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt |
We investigate the CLIP image encoder by analyzing how individual model
components affect the final representation. We decompose the image
representation as a sum across individual image patches, model layers, and
attention heads, and use CLIP's text representation to interpret the summands.
Interpreting the attention heads, we characterize each head's role by
automatically finding text representations that span its output space, which
reveals property-specific roles for many heads (e.g. location or shape). Next,
interpreting the image patches, we uncover an emergent spatial localization
within CLIP. Finally, we use this understanding to remove spurious features
from CLIP and to create a strong zero-shot image segmenter. Our results
indicate that a scalable understanding of transformer models is attainable and
can be used to repair and improve models. |
This paper investigates the internal structure of CLIP's image encoder, particularly the ViT-based variant, to understand how individual components like layers, attention heads, and image patches contribute to the final image representation. |
This work is important because it provides a deeper understanding of how CLIP encodes information, which can be used to improve its performance on downstream tasks. By decomposing CLIP's representations and linking them to specific components and image regions, the authors offer insights into the model's decision-making process and pave the way for more interpretable and robust vision-language models. |
The authors decompose CLIP's image representation into contributions from individual layers, attention heads, and image tokens. They leverage the residual structure of ViT to analyze direct contributions and develop an algorithm called TextSpan to associate text descriptions with the latent directions of each attention head. By analyzing these text descriptions and visualizing the contributions of different image regions, they uncover specific roles for many attention heads and reveal an emergent spatial localization within CLIP. |
The paper demonstrates that the last few attention layers in CLIP-ViT have the most significant direct effect on the image representation. The authors also find that many attention heads specialize in capturing specific image properties like shape, color, or location. They leverage this finding to reduce spurious correlations in downstream classification tasks and achieve state-of-the-art performance on zero-shot semantic image segmentation. |
The authors acknowledge limitations in addressing indirect effects between layers and the lack of clear roles for all attention heads. Future work could explore these indirect effects, analyze the collaborative roles of multiple heads, and extend the analysis to other CLIP architectures like ResNet. |
diffusion_model, llm, analysis, interpretability, attention, clip, vit, zero-shot learning, segmentation, spurious correlations |
2310.12103 |
Quality Diversity through Human Feedback |
Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, Joel Lehman |
Reinforcement Learning from Human Feedback (RLHF) has shown potential in
qualitative tasks where clear objectives are lacking. However, its
effectiveness is not fully realized when it is conceptualized merely as a tool
to optimize average human preferences, especially in generative tasks that
demand diverse model responses. Meanwhile, Quality Diversity (QD) algorithms
excel at identifying diverse and high-quality solutions but often rely on
manually crafted diversity metrics. This paper introduces Quality Diversity
through Human Feedback (QDHF), a novel approach integrating human feedback into
the QD framework. QDHF infers diversity metrics from human judgments of
similarity among solutions, thereby enhancing the applicability and
effectiveness of QD algorithms. Our empirical studies show that QDHF
significantly outperforms state-of-the-art methods in automatic diversity
discovery and matches the efficacy of using manually crafted metrics for QD on
standard benchmarks in robotics and reinforcement learning. Notably, in a
latent space illumination task, QDHF substantially enhances the diversity in
images generated by a diffusion model and was more favorably received in user
studies. We conclude by analyzing QDHF's scalability and the quality of its
derived diversity metrics, emphasizing its potential to improve exploration and
diversity in complex, open-ended optimization tasks. Source code is available
on GitHub: https://github.com/ld-ing/qdhf. |
This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach that integrates human feedback into Quality Diversity (QD) algorithms to automatically learn diversity metrics for optimizing the generation of diverse and high-quality solutions. |
This paper is important because it addresses the limitations of existing QD algorithms that rely on manually crafted diversity metrics, which restricts their applicability in complex and open-ended tasks where defining such metrics is challenging. QDHF offers a more flexible and adaptable approach by leveraging human feedback to learn diversity metrics, potentially leading to improved exploration and diversity in various domains. |
The authors propose an implementation of QDHF using latent space projection and contrastive learning. They first train a latent projection model to map solutions into a latent space, where each dimension represents a learned diversity metric. Then, they use human judgments on the similarity of solutions to fine-tune the latent projection model via contrastive learning, ensuring the learned diversity metrics align with human perception. They evaluate QDHF on three benchmark tasks: robotic arm control, maze navigation, and latent space illumination for image generation, comparing it against existing QD algorithms with unsupervised diversity discovery and ground truth metrics. |
Experimental results demonstrate that QDHF significantly outperforms unsupervised diversity discovery methods in QD, achieving both higher quality and diversity in the generated solutions. Notably, in the latent space illumination task, QDHF successfully generates more diverse images while maintaining high quality compared to baseline methods. User studies further confirm that QDHF-generated images are perceived as more diverse and preferred by humans. |
The authors acknowledge that the performance of QDHF relies on the accuracy of the learned latent projection model and the quality of human feedback. They suggest future work focusing on improving the generalization of the preference model used to collect human feedback, exploring strategies for efficient and diverse data collection, and applying QDHF to more complex and open-ended tasks in robotics, reinforcement learning, and generative modeling. |
diffusion_model, analysis, 3d, motion, interpretability, quality_diversity, human_feedback, contrastive_learning, latent_space, image_generation |
2312.07409 |
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing |
Kaiwen Zhang, Yifan Zhou, Xudong Xu, Xingang Pan, Bo Dai |
Diffusion models have achieved remarkable image generation quality surpassing
previous generative models. However, a notable limitation of diffusion models,
in comparison to GANs, is their difficulty in smoothly interpolating between
two image samples, due to their highly unstructured latent space. Such a smooth
interpolation is intriguing as it naturally serves as a solution for the image
morphing task with many applications. In this work, we present DiffMorpher, the
first approach enabling smooth and natural image interpolation using diffusion
models. Our key idea is to capture the semantics of the two images by fitting
two LoRAs to them respectively, and interpolate between both the LoRA
parameters and the latent noises to ensure a smooth semantic transition, where
correspondence automatically emerges without the need for annotation. In
addition, we propose an attention interpolation and injection technique and a
new sampling schedule to further enhance the smoothness between consecutive
images. Extensive experiments demonstrate that DiffMorpher achieves starkly
better image morphing effects than previous methods across a variety of object
categories, bridging a critical functional gap that distinguished diffusion
models from GANs. |
This paper introduces DiffMorpher, a novel approach leveraging pre-trained diffusion models like Stable Diffusion to generate smooth and natural image morphing sequences. |
This paper is significant as it addresses a key limitation of diffusion models compared to GANs: their difficulty in smooth image interpolation, essential for realistic image morphing with various applications in animation, entertainment, and data augmentation. |
DiffMorpher works by first fine-tuning two LoRAs to capture the semantics of two input images. Then, it interpolates between both the LoRA parameters and the latent noises obtained by DDIM inversion, ensuring smooth semantic and spatial transitions. It further incorporates attention interpolation and replacement for texture consistency, AdaIN adjustment for color coherence, and a new sampling schedule for uniform transition speed. |
DiffMorpher demonstrates superior performance over existing image morphing methods, evidenced by lower FID, PPL, and a newly proposed PDV metric on their MorphBench dataset. The approach produces high-quality, semantically consistent, and smooth image morphing sequences for diverse objects and styles, confirmed by both qualitative and quantitative evaluations, including a user study. |
Limitations include the need for LoRA training time for each image pair and reliance on text prompts. Future work could explore faster adaptation methods and incorporate correspondence information for challenging cases with unclear object alignment. |
diffusion_model, image_morphing, lora, attention_mechanism, smooth_interpolation, stable diffusion, ddim, adain |
2404.05331 |
Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt |
Zhiqi Huang, Huixin Xiong, Haoyu Wang, Longguang Wang, Zhiheng Li |
Text-to-image generation has witnessed great progress, especially with the
recent advancements in diffusion models. Since texts cannot provide detailed
conditions like object appearance, reference images are usually leveraged for
the control of objects in the generated images. However, existing methods still
suffer limited accuracy when the relationship between the foreground and
background is complicated. To address this issue, we develop a framework termed
Mask-ControlNet by introducing an additional mask prompt. Specifically, we
first employ large vision models to obtain masks to segment the objects of
interest in the reference image. Then, the object images are employed as
additional prompts to facilitate the diffusion model to better understand the
relationship between foreground and background regions during image generation.
Experiments show that the mask prompts enhance the controllability of the
diffusion model to maintain higher fidelity to the reference image while
achieving better image quality. Comparison with previous text-to-image
generation methods demonstrates our method's superior quantitative and
qualitative performance on the benchmark datasets. |
This paper introduces Mask-ControlNet, a novel framework for enhancing text-to-image generation quality using an additional mask prompt, aiming to improve object fidelity and foreground-background harmony. |
This research is important because it addresses the limitations of existing text-to-image generation models in accurately replicating objects from reference images, particularly in complex compositions, and proposes a solution to enhance image quality and controllability. |
The authors propose a two-stage framework: 1) Training phase: They train a diffusion model with a combination of text prompts, reference images, and object masks extracted using SAM. The model learns to generate images conditioned on these inputs. 2) Inference phase: Given a reference image and a text prompt, SAM segments the object, and the model generates an image adhering to the text prompt while maintaining fidelity to the segmented object. |
The paper shows that using mask prompts leads to: - Improved object fidelity, preserving details and reducing distortions. - Better handling of complex foreground-background relationships, resulting in more harmonious compositions. - Quantitatively, Mask-ControlNet outperforms existing methods in FID, PSNR, SSIM, LPIPS, CLIP, and DINO scores. - Qualitatively, generated images exhibit higher visual quality and realism, as confirmed by user studies. |
The paper does not explicitly mention limitations or future work. However, potential areas for improvement include: - Exploring different mask generation techniques beyond SAM to handle more complex scenes and object boundaries. - Investigating the generalization ability of the model to unseen object categories and diverse datasets. - Extending the framework to allow for more fine-grained control over object placement and relationships within the generated image. |
diffusion_model, image_generation, object_reconstruction, mask, controllability, foreground-background, fidelity |
2402.12004 |
Direct Consistency Optimization for Compositional Text-to-Image Personalization |
Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin |
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal
images, are able to generate visuals with a high degree of consistency.
However, they still lack in synthesizing images of different scenarios or
styles that are possible in the original pretrained models. To address this, we
propose to fine-tune the T2I model by maximizing consistency to reference
images, while penalizing the deviation from the pretrained model. We devise a
novel training objective for T2I diffusion models that minimally fine-tunes the
pretrained model to achieve consistency. Our method, dubbed \emph{Direct
Consistency Optimization}, is as simple as regular diffusion loss, while
significantly enhancing the compositionality of personalized T2I models. Also,
our approach induces a new sampling method that controls the tradeoff between
image fidelity and prompt fidelity. Lastly, we emphasize the necessity of using
a comprehensive caption for reference images to further enhance the image-text
alignment. We show the efficacy of the proposed method on the T2I
personalization for subject, style, or both. In particular, our method results
in a superior Pareto frontier to the baselines. Generated examples and codes
are in our project page( https://dco-t2i.github.io/). |
This paper introduces Direct Consistency Optimization (DCO), a novel fine-tuning objective for Text-to-Image (T2I) diffusion models that improves personalized image generation by maximizing consistency to reference images while minimizing deviation from the pretrained model. |
This paper is important because it addresses the limitations of current personalized T2I models, which often struggle to balance subject consistency with the ability to generate diverse images in different scenarios or styles. DCO offers a more principled approach to fine-tuning, resulting in more compositional and controllable image generation. |
The authors formulate fine-tuning as a constrained policy optimization problem, encouraging the model to learn minimal information from reference images while retaining knowledge from the pretrained model. They derive an upper bound to this objective, leading to the DCO loss, which is as easy to implement as the standard diffusion loss. They also introduce a 'reward guidance' sampling method to control the trade-off between subject fidelity and text prompt fidelity and emphasize the importance of using comprehensive captions for reference images. |
DCO outperforms baselines like DreamBooth and its variants in subject and style personalization tasks. Notably, DCO generates images with higher fidelity to both subjects and input text prompts, as evidenced by quantitative metrics and qualitative examples. It also enables the seamless composition of independently fine-tuned subject and style models without requiring additional post-processing steps like ZipLoRA. |
The authors acknowledge the increased computational burden of DCO during both training and inference due to the additional forward passes through the pretrained model. They suggest exploring efficient fine-tuning methods to enhance scalability. Additionally, while cosine similarity was used to assess LoRA compatibility, the authors acknowledge the need for further investigation into metrics that accurately capture interference between LoRA models. |
diffusion_model, t2i, personalization, fine-tuning, compositionality, image_generation, dreambooth, lora, consistency, reward_guidance |
2403.14602 |
ReNoise: Real Image Inversion Through Iterative Noising |
Daniel Garibi, Or Patashnik, Andrey Voynov, Hadar Averbuch-Elor, Daniel Cohen-Or |
Recent advancements in text-guided diffusion models have unlocked powerful
image manipulation capabilities. However, applying these methods to real images
necessitates the inversion of the images into the domain of the pretrained
diffusion model. Achieving faithful inversion remains a challenge, particularly
for more recent models trained to generate images with a small number of
denoising steps. In this work, we introduce an inversion method with a high
quality-to-operation ratio, enhancing reconstruction accuracy without
increasing the number of operations. Building on reversing the diffusion
sampling process, our method employs an iterative renoising mechanism at each
inversion sampling step. This mechanism refines the approximation of a
predicted point along the forward diffusion trajectory, by iteratively applying
the pretrained diffusion model, and averaging these predictions. We evaluate
the performance of our ReNoise technique using various sampling algorithms and
models, including recent accelerated diffusion models. Through comprehensive
evaluations and comparisons, we show its effectiveness in terms of both
accuracy and speed. Furthermore, we confirm that our method preserves
editability by demonstrating text-driven image editing on real images. |
This paper proposes ReNoise, a new diffusion model inversion method that enhances reconstruction accuracy and editability, especially for recent few-step models, without increasing computational cost. |
This research is important because it addresses the limitations of existing inversion methods for real image editing with diffusion models, particularly in the context of few-step models which are essential for interactive editing workflows. |
The authors developed ReNoise, a technique based on fixed-point iteration that refines the approximation of points along the forward diffusion trajectory during the inversion process. This is achieved by iteratively renoising the latent representation using the pre-trained diffusion model and averaging the resulting predictions. They also introduce techniques to enhance editability and correct noise in non-deterministic samplers. |
ReNoise demonstrates superior reconstruction quality compared to existing sampler reversing methods, including DDIM inversion, for a fixed number of UNet operations. It also shows improved editability, enabling successful text-driven manipulations on real images, even with few-step models like SDXL Turbo and LCM LoRA. ReNoise is numerically stable, converges consistently, and outperforms other null-prompt inversion methods in terms of speed and accuracy. |
The authors acknowledge the limitation of model-specific hyperparameter tuning for edit enhancement and noise correction in ReNoise. Future work includes more extensive testing with advanced editing methods and adapting ReNoise to video diffusion models. |
diffusion_model, image_editing, inversion, few-step_models, analysis, ddim, sdxl turbo, lcm |
2312.13558 |
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction |
Pratyusha Sharma, Jordan T. Ash, Dipendra Misra |
Transformer-based Large Language Models (LLMs) have become a fixture in
modern machine learning. Correspondingly, significant resources are allocated
towards research that aims to further advance this technology, typically
resulting in models of increasing size that are trained on increasing amounts
of data. This work, however, demonstrates the surprising result that it is
often possible to significantly improve the performance of LLMs by selectively
removing higher-order components of their weight matrices. This simple
intervention, which we call LAyer-SElective Rank reduction (LASER), can be done
on a model after training has completed, and requires no additional parameters
or data. We show extensive experiments demonstrating the generality of this
finding across language models and datasets, and provide in-depth analyses
offering insights into both when LASER is effective and the mechanism by which
it operates. |
This paper introduces LAyer-SElective Rank reduction (LASER), a technique for improving the performance of Large Language Models (LLMs) by selectively removing higher-order components from weight matrices in specific layers. |
The paper is important because it challenges the conventional belief that larger models always perform better. It demonstrates a simple yet effective method to enhance LLM accuracy on various NLP and even reinforcement learning tasks without requiring additional training data or parameters. |
The authors apply LASER by using Singular Value Decomposition (SVD) to identify and remove higher-order components from specific weight matrices of pre-trained LLMs. They experiment with different layers and reduction percentages, evaluating the impact on accuracy and other metrics across various datasets and LLM architectures. |
LASER significantly improves accuracy on several NLP tasks, especially those involving less frequent information in the training data. For instance, GPT-J's accuracy on the CounterFact dataset increased from 13.3% to 24.1%. The technique also enhances robustness to paraphrases. Notably, LASER even benefits a Decision Transformer agent in a Sokoban environment, hinting at broader applicability beyond NLP. |
The authors acknowledge limitations and propose future work on: (1) understanding why higher-order components accumulate noisy answers during training, (2) investigating the effect of model architecture on LASER's effectiveness, and (3) explaining the specific benefit of pruning later MLP layers. Further research is needed to explore alternative pruning methods and analyze the impact of LASER on language modeling and fluency in detail. |
llm, analysis, svd, pruning, rank_reduction, question_answering, factuality, decision_transformer, reinforcement_learning |
2402.06196 |
Large Language Models: A Survey |
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao |
Large Language Models (LLMs) have drawn a lot of attention due to their
strong performance on a wide range of natural language tasks, since the release
of ChatGPT in November 2022. LLMs' ability of general-purpose language
understanding and generation is acquired by training billions of model's
parameters on massive amounts of text data, as predicted by scaling laws
\cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while
very recent, is evolving rapidly in many different ways. In this paper, we
review some of the most prominent LLMs, including three popular LLM families
(GPT, LLaMA, PaLM), and discuss their characteristics, contributions and
limitations. We also give an overview of techniques developed to build, and
augment LLMs. We then survey popular datasets prepared for LLM training,
fine-tuning, and evaluation, review widely used LLM evaluation metrics, and
compare the performance of several popular LLMs on a set of representative
benchmarks. Finally, we conclude the paper by discussing open challenges and
future research directions. |
This paper presents a survey of Large Language Models (LLMs), covering their evolution from early neural language models, prominent LLM families (GPT, LLaMA, PaLM), techniques for building and augmenting LLMs, popular datasets and benchmarks, and an overview of performance comparisons. |
This paper is important due to the rapid evolution and increasing influence of LLMs in various domains. It provides a comprehensive overview of LLM advancements, techniques, and challenges, serving as a valuable resource for researchers and practitioners seeking to understand and utilize LLMs effectively. |
The paper conducts a literature review, summarizing key findings and advancements in the field of LLMs. It analyzes prominent LLM architectures, pre-training methods, fine-tuning and alignment techniques, and prompt engineering strategies. Additionally, it reviews popular datasets and benchmarks used for LLM evaluation, comparing the performance of notable models. |
The survey highlights the impressive performance and capabilities of LLMs across various NLP tasks, including commonsense reasoning, code generation, and question answering. It showcases the benefits of prompt engineering techniques like Chain of Thought (CoT), Retrieval Augmented Generation (RAG), and the use of external tools to augment LLM functionality. The paper also emphasizes the importance of addressing challenges like hallucination, ethical concerns, and the need for smaller and more efficient LLM models. |
The paper identifies several challenges and future research directions for LLMs, including the development of smaller and more efficient models, exploring new post-attention architectural paradigms, enhancing multi-modal capabilities, improving LLM usage and augmentation techniques, and addressing security and ethical concerns. It emphasizes the need for continued research in these areas to unlock the full potential of LLMs while mitigating their limitations. |
llm, survey, gpt, llama, palm, transformer, pre-training, fine-tuning, alignment, prompt_engineering, rag, hallucination, ethical_ai, multi-modal, analysis, literature_review, code_generation, reasoning |
2311.15657 |
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning |
Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin |
Text-to-image diffusion models are typically trained to optimize the
log-likelihood objective, which presents challenges in meeting specific
requirements for downstream tasks, such as image aesthetics and image-text
alignment. Recent research addresses this issue by refining the diffusion U-Net
using human rewards through reinforcement learning or direct backpropagation.
However, many of them overlook the importance of the text encoder, which is
typically pretrained and fixed during training. In this paper, we demonstrate
that by finetuning the text encoder through reinforcement learning, we can
enhance the text-image alignment of the results, thereby improving the visual
quality. Our primary motivation comes from the observation that the current
text encoder is suboptimal, often requiring careful prompt adjustment. While
fine-tuning the U-Net can partially improve performance, it remains suffering
from the suboptimal text encoder. Therefore, we propose to use reinforcement
learning with low-rank adaptation to finetune the text encoder based on
task-specific rewards, referred as \textbf{TexForce}. We first show that
finetuning the text encoder can improve the performance of diffusion models.
Then, we illustrate that TexForce can be simply combined with existing U-Net
finetuned models to get much better results without additional training.
Finally, we showcase the adaptability of our method in diverse applications,
including the generation of high-quality face and hand images. |
This paper introduces TexForce, a novel method to improve text-to-image diffusion models by fine-tuning the text encoder using reinforcement learning with low-rank adaptation (LoRA) and task-specific rewards, leading to better text-image alignment and higher visual quality. |
This paper addresses the limitation of previous diffusion model fine-tuning methods that solely focus on the U-Net, neglecting the importance of the text encoder. It demonstrates that fine-tuning the text encoder is crucial for aligning generated images with text prompts, especially with limited training data, and shows its efficacy across different tasks and backbones. |
The authors propose TexForce, which employs reinforcement learning, particularly the DDPO algorithm, to update the text encoder by maximizing task-specific rewards for generated images. They utilize LoRA for efficient fine-tuning and demonstrate its flexibility by combining LoRA weights from different tasks. Experiments are conducted with various prompt datasets, reward functions (ImageReward, HPSv2, face quality, hand detection confidence), and diffusion model backbones (SDv1.4, SDv1.5, SDv2.1). |
TexForce significantly enhances text-image alignment and visual quality across various tasks, outperforming existing methods like DPOK, ReFL, and AlignProp. It shows robust performance on different backbones and the capability to combine with U-Net fine-tuning for further improvement. GPT-4V evaluation confirms its effectiveness in both aesthetics and text-coherence. Furthermore, the fusion of LoRA weights enables enhancement of specific objects within generated images. |
The authors acknowledge limitations regarding sample efficiency and complexity of reward function engineering inherent to RL-based methods. They also raise concerns about potential misuse for misinformation and intellectual property infringement. Future work could address these limitations and explore broader applications of TexForce. |
diffusion_model, text-to-image, reinforcement_learning, lora, text-image_alignment, image_quality, gpt-4v, face_generation, hand_generation |
2309.03904 |
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis |
Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yujun Shen |
Due to the difficulty in scaling up, generative adversarial networks (GANs)
seem to be falling from grace on the task of text-conditioned image synthesis.
Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a
valid solution to training large-scale models with limited computational
resources. Inspired by such a philosophy, we present Aurora, a GAN-based
text-to-image generator that employs a collection of experts to learn feature
processing, together with a sparse router to help select the most suitable
expert for each feature point. To faithfully decode the sampling stochasticity
and the text condition to the final synthesis, our router adaptively makes its
decision by taking into account the text-integrated global latent code. At
64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves
6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate
the community for further development. |
This paper introduces Aurora, a text-to-image GAN model that leverages Sparse Mixture of Experts (MoE) to enhance model capacity and generate high-quality images from text descriptions. |
This paper is important because it addresses the limitations of GANs in text-to-image synthesis, particularly their difficulty in scaling up to handle complex datasets and open-vocabulary text prompts. By incorporating Sparse MoE, Aurora achieves comparable performance to diffusion models while maintaining faster generation speeds. The release of their code and checkpoints also provides a valuable resource for the research community to further explore and advance text-to-image generation with GANs. |
The authors developed Aurora, a GAN-based text-to-image generator, incorporating a Sparse Mixture of Experts (MoE) approach. The generator uses CLIP to encode the input text and a mapping network to process both the text and a latent code. A series of generative blocks, each with a convolution block and an attention block, progressively increase the resolution of the generated image. The attention block employs MoE, utilizing a sparse router to select the most appropriate expert for each feature point based on both the input feature and text information. The model is trained progressively on LAION2B-en and COYO-700M datasets using a combination of adversarial loss, matching-aware loss, multi-level CLIP loss, and MoE loss. The authors use reference FID scores as an indicator to transition between training stages at different image resolutions. |
Aurora achieves a 6.2 zero-shot FID score on MS COCO at 64x64 resolution, demonstrating its capability for open-vocabulary text-to-image synthesis. The authors also found that their sparse router effectively clusters pixels with similar visual concepts. Interestingly, they observed unexpected behavior during latent space interpolation, suggesting a potential research direction in disentangling text conditions and sampling stochasticity. |
The paper acknowledges limitations in latent space interpolation, attributing them to the absence of perceptual path length regularization and potential dominance of text tokens over the global latent code. Future work includes investigating these issues, exploring better text information injection methods, and improving the model's performance and functionality using cleaner, higher-quality datasets. |
diffusion_model, gan, text-to-image, image_synthesis, sparse_moe, attention, latent_space, open-vocabulary |
2404.14367 |
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data |
Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar |
Learning from preference labels plays a crucial role in fine-tuning large
language models. There are several distinct approaches for preference
fine-tuning, including supervised learning, on-policy reinforcement learning
(RL), and contrastive learning. Different methods come with different
implementation tradeoffs and performance differences, and existing empirical
findings present different conclusions, for instance, some results show that
online RL is quite important to attain good fine-tuning results, while others
find (offline) contrastive or even purely supervised methods sufficient. This
raises a natural question: what kind of approaches are important for
fine-tuning with preference data and why? In this paper, we answer this
question by performing a rigorous analysis of a number of fine-tuning
techniques on didactic and full-scale LLM problems. Our main finding is that,
in general, approaches that use on-policy sampling or attempt to push down the
likelihood on certain responses (i.e., employ a "negative gradient") outperform
offline and maximum likelihood objectives. We conceptualize our insights and
unify methods that use on-policy sampling or negative gradient under a notion
of mode-seeking objectives for categorical distributions. Mode-seeking
objectives are able to alter probability mass on specific bins of a categorical
distribution at a fast rate compared to maximum likelihood, allowing them to
relocate masses across bins more effectively. Our analysis prescribes
actionable insights for preference fine-tuning of LLMs and informs how data
should be collected for maximal improvement. |
This paper investigates the effectiveness of different fine-tuning methods for large language models (LLMs) on tasks involving binary preferences, particularly focusing on the roles of on-policy sampling and negative gradients. |
This paper provides clarity on the effectiveness and trade-offs of various LLM fine-tuning methods, guiding practitioners in selecting the best approach for their specific preference optimization problem. It unifies seemingly distinct notions of on-policy sampling and negative gradients under the concept of mode-seeking objectives, which helps in understanding the behavior of different algorithms. |
The authors conduct a rigorous empirical study using a variety of tasks, including didactic bandit problems, synthetic LLM problems with hand-crafted reward functions, and full-scale LLM fine-tuning problems with real human preference data from AlpacaFarm and UltraFeedback. They analyze the performance of different algorithms (PPO, REINFORCE, DPO, IPO, RWR, Pref-FT, Best-of-N) by varying the degree of on-policy sampling and use of negative gradients. |
The key findings are that on-policy sampling significantly improves performance and efficiency, especially when the reward peak is far from the reference policy. Negative gradients are also beneficial, leading to faster convergence, and complement on-policy sampling. The study finds that both techniques are unified by the concept of mode-seeking divergences, which prioritize sharpening probability mass on high-reward regions, as opposed to mode-covering objectives like maximum likelihood. |
The paper acknowledges limitations in terms of lacking rigorous statistical guarantees for the observed benefits of on-policy sampling and negative gradients. Future work could involve formalizing these benefits statistically. Further exploration could incorporate the role of pre-training distribution coverage, reward model quality, and recent minimax formulations in preference optimization. |
llm, analysis, fine-tuning, preference_learning, reinforcement_learning, contrastive_learning, on-policy, negative_gradient, mode-seeking |
2308.12605 |
APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency |
Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng |
Diffusion models have exhibited promising progress in video generation.
However, they often struggle to retain consistent details within local regions
across frames. One underlying cause is that traditional diffusion models
approximate Gaussian noise distribution by utilizing predictive noise, without
fully accounting for the impact of inherent information within the input
itself. Additionally, these models emphasize the distinction between
predictions and references, neglecting information intrinsic to the videos. To
address this limitation, inspired by the self-attention mechanism, we propose a
novel text-to-video (T2V) generation network structure based on diffusion
models, dubbed Additional Perturbation for Latent noise with Adversarial
training (APLA). Our approach only necessitates a single video as input and
builds upon pre-trained stable diffusion networks. Notably, we introduce an
additional compact network, known as the Video Generation Transformer (VGT).
This auxiliary component is designed to extract perturbations from the inherent
information contained within the input, thereby refining inconsistent pixels
during temporal predictions. We leverage a hybrid architecture of transformers
and convolutions to compensate for temporal intricacies, enhancing consistency
between different frames within the video. Experiments demonstrate a noticeable
improvement in the consistency of the generated videos both qualitatively and
quantitatively. |
This paper introduces APLA, a novel text-to-video generation network structure based on diffusion models, which leverages an additional compact network called Video Generation Transformer (VGT) to enhance the consistency of generated videos by extracting and utilizing inherent information from the input video. |
This paper addresses the limitations of existing video generation diffusion models in maintaining consistency across frames, particularly in retaining local details. It proposes a novel approach using VGT and adversarial training to improve the temporal coherence and overall quality of generated videos, marking a significant step towards high-fidelity video generation. |
The authors propose APLA, which adds VGT on top of pre-trained diffusion models. VGT, designed in two variants (pure Transformer decoder and a hybrid with 3D convolution), extracts inherent information from the input video. The authors introduce a hyper-loss function combining MSE, L1, and perceptual loss for better latent noise fitting. Furthermore, they incorporate adversarial training with a 1x1 convolutional discriminator to enhance the robustness and quality of the generated videos. Experiments were conducted on the DAVIS dataset, comparing APLA with existing methods using CLIP score and FCI metrics. Ablation studies were also performed to evaluate the impact of each component in APLA. |
APLA demonstrates superior performance in generating consistent and high-quality videos compared to existing methods. Notably, it shows significant improvement in retaining local details across frames, addressing a key limitation of previous diffusion models. Quantitative evaluations using CLIP score and FCI confirm APLA's enhanced content and frame consistency, achieving state-of-the-art results. Ablation studies confirm that each component of APLA contributes to the overall performance, with the full model achieving the best results, showcasing the effectiveness of combining VGT, hyper-loss, and adversarial training. |
The authors acknowledge limitations regarding the computational cost of APLA, which requires more time for inference compared to some existing methods. For future work, exploring more efficient architectures for VGT to reduce computational complexity is suggested. Additionally, investigating the generalization capabilities of APLA on a wider range of datasets and exploring its application to other video generation tasks, such as video prediction or video editing, could be promising directions. |
diffusion_model, video, generation, t2v, consistency, transformer, adversarial_training |
2403.13807 |
Editing Massive Concepts in Text-to-Image Diffusion Models |
Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu |
Text-to-image diffusion models suffer from the risk of generating outdated,
copyrighted, incorrect, and biased content. While previous methods have
mitigated the issues on a small scale, it is essential to handle them
simultaneously in larger-scale real-world scenarios. We propose a two-stage
method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage
performs memory optimization for each individual concept with dual
self-distillation from text alignment loss and diffusion noise prediction loss.
The second stage conducts massive concept editing with multi-layer, closed form
model editing. We further propose a comprehensive benchmark, named ImageNet
Concept Editing Benchmark (ICEB), for evaluating massive concept editing for
T2I models with two subtasks, free-form prompts, massive concept categories,
and extensive evaluation metrics. Extensive experiments conducted on our
proposed benchmark and previous benchmarks demonstrate the superior scalability
of EMCID for editing up to 1,000 concepts, providing a practical approach for
fast adjustment and re-deployment of T2I diffusion models in real-world
applications. |
This paper introduces EMCID, a two-stage method for editing large numbers of concepts in text-to-image diffusion models, addressing issues like outdated information, biases, and copyright infringement. |
The paper is important because it offers a practical solution to mitigate problematic content generation in large diffusion models, which is crucial for their safe and responsible deployment in real-world applications. |
EMCID first optimizes individual concept representations in the text encoder using dual self-distillation from text alignment and noise prediction losses. The second stage then aggregates these optimized representations and edits multiple layers of the model using a closed-form solution. |
EMCID demonstrates superior scalability compared to previous methods, successfully editing up to 1,000 concepts while preserving the model's generation quality. It excels in updating, erasing, and rectifying concepts, as evidenced by extensive evaluations on the proposed ImageNet Concept Editing Benchmark (ICEB) and other benchmarks. |
The authors acknowledge that EMCID might not effectively eliminate NSFW content generation, particularly from prompts with low toxicity. Future work could focus on addressing this limitation, potentially by combining EMCID with methods targeting other parts of the diffusion model. |
diffusion_model, concept_editing, text-to-image, model_editing, large_scale, interpretability |
2403.02580 |
What do we learn from inverting CLIP models? |
Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein |
We employ an inversion-based approach to examine CLIP models. Our examination
reveals that inverting CLIP models results in the generation of images that
exhibit semantic alignment with the specified target prompts. We leverage these
inverted images to gain insights into various aspects of CLIP models, such as
their ability to blend concepts and inclusion of gender biases. We notably
observe instances of NSFW (Not Safe For Work) images during model inversion.
This phenomenon occurs even for semantically innocuous prompts, like "a
beautiful landscape," as well as for prompts involving the names of
celebrities. |
This paper investigates the inner workings and potential biases of CLIP models by employing an inversion-based approach, generating images from text prompts to analyze CLIP's understanding of concepts, gender, and its proclivity to produce NSFW content. |
This research is crucial as it provides insights into the often opaque training data and potential biases of widely used CLIP models, particularly highlighting the risk of generating NSFW content even from innocuous prompts, which has significant implications for downstream applications like text-to-image generation. |
The authors invert CLIP models by optimizing images to closely align with given text prompts, utilizing techniques like random augmentations, ensembling, and regularization. They analyze the generated images for their ability to blend concepts, the presence of NSFW content, gender biases, and the impact of training data scale. |
The study reveals that CLIP models can blend concepts effectively, often producing recognizable images from celebrity names. However, it also uncovers a concerning tendency to generate NSFW imagery, even from seemingly harmless prompts, including those related to landscapes and certain celebrities. This suggests the presence of a significant amount of NSFW content in the training data. Additionally, the research exposes gender biases within CLIP, as it associates specific professions and social statuses with particular genders. Lastly, it demonstrates that the scale of the training data directly influences the quality of the generated images, with larger datasets yielding better results. |
The authors acknowledge the limitation of using generative methods to analyze a model not typically used for generation. Future work could involve exploring alternative methods to confirm these findings. Furthermore, the study emphasizes the need for better data filtering and curation during CLIP training to mitigate the generation of NSFW content and address inherent biases. Investigating methods to address the proximity of specific prompts to NSFW words in the embedding space is also crucial. |
clip, analysis, nsfw, gender bias, model inversion, interpretability |
2401.10020 |
Self-Rewarding Language Models |
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston |
We posit that to achieve superhuman agents, future models require superhuman
feedback in order to provide an adequate training signal. Current approaches
commonly train reward models from human preferences, which may then be
bottlenecked by human performance level, and secondly these separate frozen
reward models cannot then learn to improve during LLM training. In this work,
we study Self-Rewarding Language Models, where the language model itself is
used via LLM-as-a-Judge prompting to provide its own rewards during training.
We show that during Iterative DPO training that not only does instruction
following ability improve, but also the ability to provide high-quality rewards
to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a
model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard,
including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still
to explore, this work opens the door to the possibility of models that can
continually improve in both axes. |
This text appears to be LaTeX code containing a series of custom commands and macros for mathematical notation and formatting commonly used in scientific papers, rather than a research paper itself. It defines various symbols, environments, and shortcuts to streamline the process of writing mathematical expressions and formatting figures, tables, and cross-references. |
While not a paper, this code snippet is important as it showcases the tools and conventions that underpin the writing and typesetting of scientific documents, particularly in fields heavy in mathematical notation like physics, mathematics, computer science, and engineering. These macros help authors maintain consistency, improve readability, and reduce redundancy in their manuscripts. |
N/A - This is not a research paper and does not have a methodology. |
N/A - This is not a research paper and does not present results. |
N/A - This is not a research paper and does not discuss limitations or future work. |
latex, mathematics, formatting, scientific writing, macros |
2309.07906 |
Generative Image Dynamics |
Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski |
We present an approach to modeling an image-space prior on scene motion. Our
prior is learned from a collection of motion trajectories extracted from real
video sequences depicting natural, oscillatory dynamics such as trees, flowers,
candles, and clothes swaying in the wind. We model this dense, long-term motion
prior in the Fourier domain:given a single image, our trained model uses a
frequency-coordinated diffusion sampling process to predict a spectral volume,
which can be converted into a motion texture that spans an entire video. Along
with an image-based rendering module, these trajectories can be used for a
number of downstream applications, such as turning still images into seamlessly
looping videos, or allowing users to realistically interact with objects in
real pictures by interpreting the spectral volumes as image-space modal bases,
which approximate object dynamics. |
This paper introduces a novel method for animating still images by predicting realistic, oscillatory motion using a learned image-space prior on scene dynamics. |
This work is significant because it addresses the challenge of synthesizing realistic and temporally coherent motion in videos generated from single images, which is crucial for creating believable visual content. |
The authors leverage spectral volumes, a frequency-domain representation of motion, and train a latent diffusion model to predict these volumes from single images. They then use an image-based rendering module to animate the input image according to the predicted motion. |
The paper demonstrates superior quantitative and qualitative results compared to existing single-image animation methods, showing more realistic and temporally consistent video generation. The authors also showcase applications like seamless looping video generation and creating interactive dynamic images from single pictures. |
The authors acknowledge limitations in modeling non-oscillatory or high-frequency motions, and potential issues with thin objects or large displacements. Future work could explore learned motion bases, handle complex motion patterns, and address challenges in generating unseen content. |
diffusion_model, motion, video, analysis, 3d |
2312.12148 |
Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment |
Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, Fu Lee Wang |
With the continuous growth in the number of parameters of transformer-based
pretrained language models (PLMs), particularly the emergence of large language
models (LLMs) with billions of parameters, many natural language processing
(NLP) tasks have demonstrated remarkable success. However, the enormous size
and computational demands of these models pose significant challenges for
adapting them to specific downstream tasks, especially in environments with
limited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers
an effective solution by reducing the number of fine-tuning parameters and
memory usage while achieving comparable performance to full fine-tuning. The
demands for fine-tuning PLMs, especially LLMs, have led to a surge in the
development of PEFT methods, as depicted in Fig. 1. In this paper, we present a
comprehensive and systematic review of PEFT methods for PLMs. We summarize
these PEFT methods, discuss their applications, and outline future directions.
Furthermore, we conduct experiments using several representative PEFT methods
to better understand their effectiveness in parameter efficiency and memory
efficiency. By offering insights into the latest advancements and practical
applications, this survey serves as an invaluable resource for researchers and
practitioners seeking to navigate the challenges and opportunities presented by
PEFT in the context of PLMs. |
This paper presents a comprehensive review and assessment of Parameter-Efficient Fine-Tuning (PEFT) methods for Pretrained Language Models (PLMs), focusing on their effectiveness in reducing trainable parameters and memory usage while maintaining comparable performance to full fine-tuning. |
This paper is important because it addresses the challenges of adapting large language models (LLMs) with billions of parameters to specific downstream tasks, especially given limited computational resources, by providing a systematic overview of PEFT methods and evaluating their performance across different tasks and models. |
The authors conducted their research by categorizing PEFT methods into five groups: additive fine-tuning, partial fine-tuning, reparameterized fine-tuning, hybrid fine-tuning, and unified fine-tuning. They then conducted experiments using eleven representative PEFT methods on three different types of PLMs (RoBERTa, T5, and LLaMA) across NLU, MT, and NLG tasks, evaluating their performance and memory usage. |
Key findings include: (1) Most PEFT methods achieve comparable or better performance than full fine-tuning on the GLUE benchmark while significantly reducing the number of trainable parameters. (2) ProPELT adapter achieves the best average performance with only 1.5% of trainable parameters compared to full fine-tuning. (3) QLoRA significantly reduces GPU memory consumption, enabling fine-tuning of LLaMA with limited resources. (4) The effectiveness of PEFT methods in reducing memory usage increases with larger model sizes. |
The paper highlights several limitations and future directions, including: (1) Exploring lightweight hybrid PEFT methods that combine multiple PEFT methods for better performance with minimal parameter increase. (2) Developing more LoRA-derived PEFT methods, focusing on pruning and weight quantization to optimize storage and computation. (3) Expanding the PEFT library by integrating additional PEFT methods for wider application. (4) Conducting further theoretical studies to understand the underlying mechanisms of PEFT methods. (5) Exploring the application of PEFT methods in computer vision and multimodal learning. |
peft, llm, fine-tuning, parameter_efficiency, memory_efficiency, adapter, lora, prompt-tuning, prefix-tuning, analysis, literature_review, nlu, mt, nlg |
2311.03648 |
Instruct Me More! Random Prompting for Visual In-Context Learning |
Jiahao Zhang, Bowen Wang, Liangzhi Li, Yuta Nakashima, Hajime Nagahara |
Large-scale models trained on extensive datasets, have emerged as the
preferred approach due to their high generalizability across various tasks.
In-context learning (ICL), a popular strategy in natural language processing,
uses such models for different tasks by providing instructive prompts but
without updating model parameters. This idea is now being explored in computer
vision, where an input-output image pair (called an in-context pair) is
supplied to the model with a query image as a prompt to exemplify the desired
output. The efficacy of visual ICL often depends on the quality of the prompts.
We thus introduce a method coined Instruct Me More (InMeMo), which augments
in-context pairs with a learnable perturbation (prompt), to explore its
potential. Our experiments on mainstream tasks reveal that InMeMo surpasses the
current state-of-the-art performance. Specifically, compared to the baseline
without learnable prompt, InMeMo boosts mIoU scores by 7.35 and 15.13 for
foreground segmentation and single object detection tasks, respectively. Our
findings suggest that InMeMo offers a versatile and efficient way to enhance
the performance of visual ICL with lightweight training. Code is available at
https://github.com/Jackieam/InMeMo. |
This paper introduces Instruct Me More (InMeMo), a novel visual in-context learning method that enhances the performance of large-scale vision models by adding a learnable perturbation to in-context image pairs, thereby improving their instructive quality for downstream tasks like segmentation and object detection. |
This paper is important because it addresses the limitations of existing visual in-context learning approaches that heavily rely on the quality and similarity of in-context pairs to query images. By introducing a learnable prompt, InMeMo improves the performance of visual in-context learning in a lightweight and efficient manner, achieving state-of-the-art results on benchmark tasks. |
InMeMo first retrieves an in-context image pair similar to the query image. It then amends the pair with a learnable prompt enhancer module, which is trained to optimize the in-context pair for the specific downstream task. The enhanced pair, along with the query image, are then fed into a frozen pre-trained large-scale vision model (MAE-VQGAN) to generate a prediction for the given task. The prompt enhancer is trained in a supervised manner using cross-entropy loss on visual tokens, aiming to minimize the difference between predicted and ground-truth labels. |
InMeMo achieves state-of-the-art results on foreground segmentation and single object detection tasks, surpassing previous visual in-context learning methods. It demonstrates robustness to domain shift and significant performance improvement even with limited training data. The paper provides extensive qualitative and quantitative results, demonstrating the efficacy of InMeMo in capturing fine-grained details and handling variations in image characteristics. |
The paper acknowledges that InMeMo requires a minimum amount of training data per class to outperform the baseline. Additionally, the learnable prompt's generalizability to unseen classes is limited, necessitating task-specific training. Future work could focus on improving the generalizability of the learnable prompt and exploring its application in other downstream tasks. |
diffusion_model, in-context learning, visual prompting, foreground segmentation, object detection, parameter-efficient transfer learning, domain shift, mae-vqgan |
2404.07984 |
View Selection for 3D Captioning via Diffusion Ranking |
Tiange Luo, Justin Johnson, Honglak Lee |
Scalable annotation approaches are crucial for constructing extensive 3D-text
datasets, facilitating a broader range of applications. However, existing
methods sometimes lead to the generation of hallucinated captions, compromising
caption quality. This paper explores the issue of hallucination in 3D object
captioning, with a focus on Cap3D method, which renders 3D objects into 2D
views for captioning using pre-trained models. We pinpoint a major challenge:
certain rendered views of 3D objects are atypical, deviating from the training
data of standard image captioning models and causing hallucinations. To tackle
this, we present DiffuRank, a method that leverages a pre-trained text-to-3D
model to assess the alignment between 3D objects and their 2D rendered views,
where the view with high alignment closely represent the object's
characteristics. By ranking all rendered views and feeding the top-ranked ones
into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the
correction of 200k captions in the Cap3D dataset and extending it to 1 million
captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase
the adaptability of DiffuRank by applying it to pre-trained text-to-image
models for a Visual Question Answering task, where it outperforms the CLIP
model. |
This paper tackles the issue of hallucination in 3D object captioning, particularly in the Cap3D method, by introducing DiffuRank, a technique that uses a pre-trained text-to-3D model to rank rendered 2D views of 3D objects based on their alignment with the object's characteristics, resulting in more accurate and detailed captions. |
This work is important because it addresses a key challenge in building large-scale 3D-text datasets: the generation of inaccurate captions due to the limitations of existing captioning models when presented with atypical or challenging views of 3D objects. By improving the accuracy and richness of 3D captions, this work can significantly benefit various 3D-related applications, including text-to-3D generation, image-to-3D conversion, robot learning, and 3D language model pre-training. |
The authors developed DiffuRank, an algorithm that leverages a pre-trained text-to-3D diffusion model to assess the alignment between different rendered 2D views of a 3D object and the object itself. They generated multiple captions for each view using an image captioning model and fed them into the diffusion model alongside the 3D object's features. By ranking the views based on their average score (loss) in the diffusion model, they identified the views that best represent the object's 3D information. These top-ranked views were then passed to GPT4-Vision for generating the final captions. |
The authors demonstrate that DiffuRank, in conjunction with GPT4-Vision, significantly improves the quality of captions for 3D objects. Key findings include: (1) DiffuRank effectively reduces hallucinations in captions, as evidenced by human studies and automated metrics. (2) Captions generated using DiffuRank are richer in detail and more accurate compared to those produced using all rendered views or a fixed set of horizontally placed views. (3) Using fewer but more informative views selected by DiffuRank can lead to better captions than using a large number of views indiscriminately. (4) DiffuRank can be extended to 2D domains and has shown promising results in Visual Question Answering tasks, outperforming CLIP on a challenging benchmark. |
The authors acknowledge the limitations of DiffuRank, particularly its computational cost due to the need for rendering multiple views, generating captions for each view, and running inference through a diffusion model. The speed of DiffuRank is a bottleneck, especially for tasks involving numerous options, such as classification or image-text retrieval. Future work could focus on improving the efficiency of DiffuRank to make it more scalable for such tasks. Additionally, the authors suggest exploring the use of even more powerful text-to-3D and captioning models to further enhance the accuracy and detail of the generated captions. Expanding the dataset to encompass all of Objaverse-XL is another avenue for future work. |
diffusion_model, llm, 3d, captioning, hallucination, view_selection, dataset, objaverse, gpt4-vision, visual_question_answering |
2404.02285 |
LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP |
Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, Ismail Ben Ayed |
In a recent, strongly emergent literature on few-shot CLIP adaptation, Linear
Probe (LP) has been often reported as a weak baseline. This has motivated
intensive research building convoluted prompt learning or feature adaptation
strategies. In this work, we propose and examine from convex-optimization
perspectives a generalization of the standard LP baseline, in which the linear
classifier weights are learnable functions of the text embedding, with
class-wise multipliers blending image and text knowledge. As our objective
function depends on two types of variables, i.e., the class visual prototypes
and the learnable blending parameters, we propose a computationally efficient
block coordinate Majorize-Minimize (MM) descent algorithm. In our full-batch MM
optimizer, which we coin LP++, step sizes are implicit, unlike standard
gradient descent practices where learning rates are intensively searched over
validation sets. By examining the mathematical properties of our loss (e.g.,
Lipschitz gradient continuity), we build majorizing functions yielding
data-driven learning rates and derive approximations of the loss's minima,
which provide data-informed initialization of the variables. Our image-language
objective function, along with these non-trivial optimization insights and
ingredients, yields, surprisingly, highly competitive few-shot CLIP
performances. Furthermore, LP++ operates in black-box, relaxes intensive
validation searches for the optimization hyper-parameters, and runs
orders-of-magnitudes faster than state-of-the-art few-shot CLIP adaptation
methods. Our code is available at:
\url{https://github.com/FereshteShakeri/FewShot-CLIP-Strong-Baseline.git}. |
The paper introduces LP++, a novel method for few-shot CLIP adaptation that significantly improves upon the standard linear probe (LP) baseline by incorporating text embeddings via learnable class-wise blending parameters, leading to a surprising improvement in performance. |
This paper is important as it challenges the established notion that LP is a weak baseline in few-shot CLIP adaptation. LP++ demonstrates that a simple, efficient, and black-box approach can achieve state-of-the-art results, outperforming more complex methods like prompt learning and adapters while being computationally efficient and not requiring access to internal representations of pre-trained models. |
The authors propose a block coordinate Majorize-Minimize (MM) descent algorithm for optimizing a cross-entropy objective function, with data-driven learning rates derived from approximate Lipschitz constants, eliminating the need for extensive hyper-parameter search. Furthermore, they leverage insights from convex optimization to derive approximations of the loss function's minima, leading to data-informed initialization of the variables. |
LP++ consistently outperforms the standard LP baseline and achieves competitive performance compared to state-of-the-art few-shot CLIP adaptation methods, particularly in low-shot scenarios. It runs orders of magnitude faster than prompt learning methods and avoids the need for intensive hyper-parameter tuning characteristic of adapter-based approaches. Furthermore, LP++ enables black-box adaptation, making it suitable for real-world, privacy-preserving situations where access to model internals is restricted. |
The paper does not explicitly mention limitations or future work. However, potential future work could explore: (1) Applying LP++ to other vision-language tasks beyond image classification. (2) Investigating the impact of different text prompt designs and how to learn them in a data-driven manner. (3) Exploring different block-cycling strategies within the BMM procedure to further improve efficiency. (4) Investigating theoretical guarantees of convergence for LP++ under specific conditions. |
diffusion_model, llm, analysis, few-shot learning, clip, optimization, black-box, linear_probe, image_classification |
2312.09323 |
Perspectives on the State and Future of Deep Learning - 2023 |
Micah Goldblum, Anima Anandkumar, Richard Baraniuk, Tom Goldstein, Kyunghyun Cho, Zachary C Lipton, Melanie Mitchell, Preetum Nakkiran, Max Welling, Andrew Gordon Wilson |
The goal of this series is to chronicle opinions and issues in the field of
machine learning as they stand today and as they change over time. The plan is
to host this survey periodically until the AI singularity
paperclip-frenzy-driven doomsday, keeping an updated list of topical questions
and interviewing new community members for each edition. In this issue, we
probed people's opinions on interpretable AI, the value of benchmarking in
modern NLP, the state of progress towards understanding deep learning, and the
future of academia. |
This paper presents a collection of opinions from prominent machine learning researchers on the current state and future directions of the field, covering topics like interpretability, benchmarking, the limitations of current paradigms, and the role of academia. |
This paper offers valuable insights into the minds of leading experts in machine learning, highlighting key challenges and opportunities that are shaping the field's trajectory. It provides a glimpse into the future of AI research and its potential impact. |
The authors conducted a survey, presenting a series of open-ended questions to prominent figures in the machine learning community. The interviewees provided their individual perspectives and insights on each topic. |
Some key findings include a consensus that current benchmarking practices are inadequate for capturing complex model behaviors like common sense. There's also debate on the interpretability of deep learning models, with some believing in its eventual achievement and others expressing skepticism. Additionally, researchers emphasize the need to move beyond scaling existing models and focus on developing new learning paradigms with stronger inductive biases. |
The paper acknowledges the limitations of current deep learning approaches, particularly concerning data efficiency and the lack of robust theoretical understanding. It suggests exploring alternative architectures, integrating planning into learning algorithms, and emphasizing multimodal learning as promising future directions. |
analysis, llm, interpretability, benchmarking, deep_learning, transformers, future_of_ai |
2404.05014 |
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators |
Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo |
Recent advances in Text-to-Video generation (T2V) have achieved remarkable
success in synthesizing high-quality general videos from textual descriptions.
A largely overlooked problem in T2V is that existing models have not adequately
encoded physical knowledge of the real world, thus generated videos tend to
have limited motion and poor variations. In this paper, we propose
\textbf{MagicTime}, a metamorphic time-lapse video generation model, which
learns real-world physics knowledge from time-lapse videos and implements
metamorphic generation. First, we design a MagicAdapter scheme to decouple
spatial and temporal training, encode more physical knowledge from metamorphic
videos, and transform pre-trained T2V models to generate metamorphic videos.
Second, we introduce a Dynamic Frames Extraction strategy to adapt to
metamorphic time-lapse videos, which have a wider variation range and cover
dramatic object metamorphic processes, thus embodying more physical knowledge
than general videos. Finally, we introduce a Magic Text-Encoder to improve the
understanding of metamorphic video prompts. Furthermore, we create a time-lapse
video-text dataset called \textbf{ChronoMagic}, specifically curated to unlock
the metamorphic video generation ability. Extensive experiments demonstrate the
superiority and effectiveness of MagicTime for generating high-quality and
dynamic metamorphic videos, suggesting time-lapse video generation is a
promising path toward building metamorphic simulators of the physical world. |
This paper introduces MagicTime, a novel approach for generating metamorphic time-lapse videos by incorporating physical knowledge into text-to-video generation models. It leverages time-lapse videos, which capture complete object transformations, to enhance the model's understanding of real-world physics and enable the generation of videos depicting complex phenomena like melting, blooming, or construction. |
This paper is important because it addresses a significant limitation in current text-to-video generation models: the lack of encoding of real-world physical knowledge. This limitation restricts these models to generating videos with simple motions and limits their ability to depict complex, transformative processes. MagicTime tackles this issue by incorporating time-lapse video data and specialized training strategies, paving the way for more realistic and dynamic video generation. |
The authors propose MagicTime, a framework that modifies pre-trained text-to-video diffusion models to generate metamorphic time-lapse videos. Key components include: 1) MagicAdapter: decouples spatial and temporal training to encode physical knowledge from metamorphic videos, 2) Dynamic Frames Extraction: adapts to the characteristics of time-lapse videos and prioritizes metamorphic features, and 3) Magic Text-Encoder: refines prompt understanding for metamorphic videos. Additionally, the authors create ChronoMagic, a new dataset of time-lapse videos with detailed captions, to train and evaluate MagicTime. |
MagicTime generates high-quality metamorphic videos that capture complex transformations and align with textual prompts. It outperforms existing text-to-video generation methods in both qualitative and quantitative evaluations, demonstrating superior visual quality, frame consistency, and text alignment. The authors also conduct ablation studies to validate the contribution of each component in MagicTime. |
The authors acknowledge limitations in evaluating generative models for metamorphic videos due to the lack of established metrics beyond FID, FVD, and CLIP Similarity. They plan to investigate more comprehensive evaluation metrics in future work. Additionally, the authors are exploring the integration of MagicTime with DiT-based architectures, such as Open-Sora-Plan, to further enhance metamorphic video generation capabilities. |
diffusion_model, video, generation, time-lapse, metamorphic, physics, dataset, magictime, chronomagic |
2403.18103 |
Tutorial on Diffusion Models for Imaging and Vision |
Stanley H. Chan |
The astonishing growth of generative tools in recent years has empowered many
exciting applications in text-to-image generation and text-to-video generation.
The underlying principle behind these generative tools is the concept of
diffusion, a particular sampling mechanism that has overcome some shortcomings
that were deemed difficult in the previous approaches. The goal of this
tutorial is to discuss the essential ideas underlying the diffusion models. The
target audience of this tutorial includes undergraduate and graduate students
who are interested in doing research on diffusion models or applying these
models to solve other problems. |
This tutorial provides a comprehensive overview of diffusion models for imaging and vision, focusing on the core concepts and mathematical foundations behind these models, such as Variational Autoencoders (VAEs), Denoising Diffusion Probabilistic Models (DDPMs), Score-Matching Langevin Dynamics (SMLDs), and Stochastic Differential Equations (SDEs). |
Diffusion models have revolutionized generative AI, enabling remarkable applications in text-to-image and text-to-video generation. This tutorial is crucial for understanding the inner workings of these models and for researchers and students aiming to contribute to this burgeoning field or apply diffusion models in various domains. |
The paper employs a step-by-step approach, beginning with the fundamentals of VAEs and progressively introducing more sophisticated concepts like DDPMs, SMLDs, and SDEs. Each section offers clear explanations, illustrative examples, mathematical derivations, and connections between different perspectives. The paper also discusses training and inference procedures for each model, highlighting the role of denoisers, score functions, and noise schedules. |
The tutorial effectively elucidates that diffusion models achieve their remarkable performance through incremental updates, gradually transforming noise into coherent data samples. The equivalence between denoising score matching and explicit score matching is a key result, justifying the use of denoisers in diffusion models. The connection between discrete-time diffusion iterations and continuous-time SDEs provides a unifying framework for analyzing and comparing different diffusion models. |
The tutorial points out that while iterative denoising is currently dominant, it may not be the definitive solution for image generation. Future research could explore more biologically plausible generative processes and address the computational cost associated with diffusion models. The justification for using non-Gaussian noise distributions is also a potential area for investigation. |
diffusion_model, vae, ddpm, smld, sde, analysis, tutorial, image_generation, denoising |
2311.13127 |
MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning |
Yixin Liu, Chenrui Fan, Yutong Dai, Xun Chen, Pan Zhou, Lichao Sun |
Text-to-image diffusion models allow seamless generation of personalized
images from scant reference photos. Yet, these tools, in the wrong hands, can
fabricate misleading or harmful content, endangering individuals. To address
this problem, existing poisoning-based approaches perturb user images in an
imperceptible way to render them "unlearnable" from malicious uses. We identify
two limitations of these defending approaches: i) sub-optimal due to the
hand-crafted heuristics for solving the intractable bilevel optimization and
ii) lack of robustness against simple data transformations like Gaussian
filtering. To solve these challenges, we propose MetaCloak, which solves the
bi-level poisoning problem with a meta-learning framework with an additional
transformation sampling process to craft transferable and robust perturbation.
Specifically, we employ a pool of surrogate diffusion models to craft
transferable and model-agnostic perturbation. Furthermore, by incorporating an
additional transformation process, we design a simple denoising-error
maximization loss that is sufficient for causing transformation-robust semantic
distortion and degradation in a personalized generation. Extensive experiments
on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing
approaches. Notably, MetaCloak can successfully fool online training services
like Replicate, in a black-box manner, demonstrating the effectiveness of
MetaCloak in real-world scenarios. Our code is available at
https://github.com/liuyixin-louis/MetaCloak. |
This paper presents MetaCloak, a novel method for protecting user images from unauthorized personalized image generation using DreamBooth by crafting robust perturbations that can withstand data transformations. |
The paper addresses the growing privacy concern of unauthorized use of personal images for AI-generated content, specifically targeting the vulnerabilities of personalized diffusion models like DreamBooth. |
The authors propose a meta-learning framework to craft transferable and model-agnostic perturbations by training over a pool of surrogate diffusion models. To enhance robustness against data transformations, they incorporate a transformation sampling process during perturbation crafting and utilize a denoising-error maximization loss to introduce semantic distortion. |
MetaCloak outperforms existing methods in protecting images under both standard training and training with data transformations, as evidenced by quantitative metrics and qualitative visualizations. It effectively degrades subject detection scores, semantic similarity, and generated image quality. Notably, MetaCloak demonstrates effectiveness in real-world scenarios by successfully fooling online training services like Replicate. |
The paper acknowledges limitations in terms of potential vulnerability to advanced adversarial purification techniques and reduced effectiveness under low poisoning ratios. Future work suggestions include investigating mechanisms to further improve stealthiness, particularly under large perturbation radii, and exploring methods for effective protection under low poisoning rates. |
diffusion_model, gan, adversarial_attack, interpretability, data_protection, privacy, dreambooth, poisoning_attack |
2402.18956 |
WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts |
Yong Hyun Ahn, Hyeon Bae Kim, Seong Tae Kim |
Recent advancements in neural networks have showcased their remarkable
capabilities across various domains. Despite these successes, the "black box"
problem still remains. Addressing this, we propose a novel framework, WWW, that
offers the 'what', 'where', and 'why' of the neural network decisions in
human-understandable terms. Specifically, WWW utilizes adaptive selection for
concept discovery, employing adaptive cosine similarity and thresholding
techniques to effectively explain 'what'. To address the 'where' and 'why', we
proposed a novel combination of neuron activation maps (NAMs) with Shapley
values, generating localized concept maps and heatmaps for individual inputs.
Furthermore, WWW introduces a method for predicting uncertainty, leveraging
heatmap similarities to estimate 'how' reliable the prediction is. Experimental
evaluations of WWW demonstrate superior performance in both quantitative and
qualitative metrics, outperforming existing methods in interpretability. WWW
provides a unified solution for explaining 'what', 'where', and 'why',
introducing a method for localized explanations from global interpretations and
offering a plug-and-play solution adaptable to various architectures. |
This paper introduces WWW, a novel framework designed to explain neural network decisions by revealing 'what' concept a neuron represents, 'where' in the input image the concept is located, and 'why' the concept contributes to the prediction. |
The paper addresses the "black box" problem in neural networks, aiming to make their decision-making process more transparent and understandable to humans. This is crucial for building trust and reliability in AI systems, especially given the increasing demand for explainable AI in various domains. |
WWW comprises three modules: 1) Concept Discovery identifies concepts represented by each neuron using adaptive cosine similarity and adaptive selection. 2) Localization identifies relevant input regions for each concept by combining neuron activation maps with Shapley values. 3) Reasoning identifies important neurons for both the predicted class and the specific input sample, highlighting differences to understand prediction reliability. |
WWW demonstrates superior performance in both qualitative and quantitative evaluations. It outperforms existing methods in accurately identifying neuron concepts, particularly with larger concept sets. The paper also shows that heatmap similarity, derived from the framework, can be a more effective measure of prediction uncertainty compared to maximum softmax probability. |
The paper acknowledges limitations in accurately identifying neuron concepts when only a few example images are available. Future work will focus on improving concept discovery by exploring different example selection strategies and concept representations. Another direction is exploring the use of heatmap similarity for misprediction detection and model improvement. |
interpretability, explanation, neural network, concept discovery, shapley value, neuron activation map, heatmap, uncertainty, analysis |
2308.08947 |
Watch Your Steps: Local Image and Scene Editing by Text Instructions |
Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski |
Denoising diffusion models have enabled high-quality image generation and
editing. We present a method to localize the desired edit region implicit in a
text instruction. We leverage InstructPix2Pix (IP2P) and identify the
discrepancy between IP2P predictions with and without the instruction. This
discrepancy is referred to as the relevance map. The relevance map conveys the
importance of changing each pixel to achieve the edits, and is used to to guide
the modifications. This guidance ensures that the irrelevant pixels remain
unchanged. Relevance maps are further used to enhance the quality of
text-guided editing of 3D scenes in the form of neural radiance fields. A field
is trained on relevance maps of training views, denoted as the relevance field,
defining the 3D region within which modifications should be made. We perform
iterative updates on the training views guided by rendered relevance maps from
the relevance field. Our method achieves state-of-the-art performance on both
image and NeRF editing tasks. Project page:
https://ashmrz.github.io/WatchYourSteps/ |
This paper presents a method for localizing image and scene edits by leveraging the discrepancy between noise predictions of a diffusion-based image editor with and without text instructions, resulting in a relevance map to guide the editing process. |
This paper addresses the limitations of existing diffusion-based image editors, particularly their tendency to over-edit. By introducing relevance maps, the method allows for precise control over the editing process, preserving irrelevant regions while ensuring the desired changes are applied effectively to both images and 3D scenes represented as neural radiance fields. |
The authors propose a relevance map calculation by measuring the difference between noise predictions from InstructPix2Pix (IP2P) with and without the edit instruction. This map, after binarization, guides the IP2P denoising process to confine edits within the relevant region. For 3D scene editing, a relevance field is trained on relevance maps of training views to maintain 3D consistency, guiding iterative updates on the scene. |
The method demonstrates state-of-the-art performance in both image and NeRF editing tasks. It outperforms baselines in preserving image consistency while achieving comparable edit quality. The relevance maps effectively guide the editing process, preventing over-editing and ensuring the edits are applied to the desired regions. The method produces sharper and higher-quality results compared to previous approaches, particularly in the context of NeRF editing. |
The authors acknowledge the method's reliance on IP2P, inheriting its limitations. Cases where IP2P fails to interpret the instruction or localize the edit properly pose challenges. Future work could explore better instruction-conditioned diffusion models and address ambiguities in localizing edits for broader applications. |
diffusion_model, image_editing, 3d, nerf, relevance_map, text-guided, scene_editing, localization |
2403.11027 |
Reward Guided Latent Consistency Distillation |
Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang |
Latent Consistency Distillation (LCD) has emerged as a promising paradigm for
efficient text-to-image synthesis. By distilling a latent consistency model
(LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates
the generation of high-fidelity images within merely 2 to 4 inference steps.
However, the LCM's efficient inference is obtained at the cost of the sample
quality. In this paper, we propose compensating the quality loss by aligning
LCM's output with human preference during training. Specifically, we introduce
Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM)
into the LCD process by augmenting the original LCD loss with the objective of
maximizing the reward associated with LCM's single-step generation. As
validated through human evaluation, when trained with the feedback of a good
RM, the 2-step generations from our RG-LCM are favored by humans over the
50-step DDIM samples from the teacher LDM, representing a 25 times inference
acceleration without quality loss.
As directly optimizing towards differentiable RMs can suffer from
over-optimization, we overcome this difficulty by proposing the use of a latent
proxy RM (LRM). This novel component serves as an intermediary, connecting our
LCM with the RM. Empirically, we demonstrate that incorporating the LRM into
our RG-LCD successfully avoids high-frequency noise in the generated images,
contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on
HPSv2's test set, surpassing those achieved by the baseline LCM. |
This paper introduces Reward Guided Latent Consistency Distillation (RG-LCD), a method for enhancing the efficiency and quality of text-to-image synthesis by incorporating feedback from a reward model (RM) into the Latent Consistency Distillation (LCD) process. |
This paper is important because it addresses the limitations of current Latent Consistency Models (LCMs) for text-to-image synthesis, which prioritize inference speed over sample quality. By integrating human preference through RMs, RG-LCD improves LCMs' generated image quality without sacrificing inference speed. |
The authors propose RG-LCD, which integrates feedback from a differentiable RM into the LCD process by augmenting the original LCD loss with a reward maximization objective. To avoid reward over-optimization, they introduce a latent proxy RM (LRM) that connects the LCM to the RM, enabling indirect optimization of the expert RM and allowing learning from non-differentiable RMs. They conduct experiments using different RMs (CLIPScore, HPSv2.1, PickScore, ImageReward) and evaluate the generated images with human evaluation and automatic metrics like HPSv2.1 score and FID. |
Human evaluation shows that the 2-step generations from RG-LCM (HPS) are preferred over the 50-step DDIM generations from the teacher LDM, indicating a 25x speedup without quality loss. RG-LCM (CLIP), despite using a non-preference-trained RM, also outperforms the teacher LDM in 4-step generations. The study found that using an LRM effectively mitigates reward over-optimization, leading to more visually appealing images and addressing the high-frequency noise issue observed when directly optimizing for certain RMs like ImageReward. Interestingly, the results also reveal discrepancies between human preferences and automatic metric scores, suggesting current metrics like HPSv2.1 may not fully capture human preferences, particularly concerning high-frequency noise due to the use of image resizing during evaluation. |
The authors acknowledge limitations in existing automatic metrics for evaluating image quality and call for the development of more robust metrics that eliminate image resizing in their evaluation process. They also suggest exploring the use of LRMs to learn human preferences directly in the latent space as a potential solution. Future work could involve investigating alternative LRM architectures, exploring different reward models and datasets, and applying RG-LCD to other generative modeling tasks beyond text-to-image synthesis. |
diffusion_model, consistency_distillation, text-to-image, reward_model, image_generation, inference_acceleration, latent_space |
2311.10770 |
Exponentially Faster Language Modelling |
Peter Belcak, Roger Wattenhofer |
Language models only really need to use an exponential fraction of their
neurons for individual inferences. As proof, we present UltraFastBERT, a BERT
variant that uses 0.3% of its neurons during inference while performing on par
with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095
neurons for each layer inference. This is achieved by replacing feedforward
networks with fast feedforward networks (FFFs). While no truly efficient
implementation currently exists to unlock the full acceleration potential of
conditional neural execution, we provide high-level CPU code achieving 78x
speedup over the optimized baseline feedforward implementation, and a PyTorch
implementation delivering 40x speedup over the equivalent batched feedforward
inference. We publish our training code, benchmarking setup, and model weights. |
This paper introduces UltraFastBERT, a variant of the BERT language model that replaces standard feedforward networks with fast feedforward networks (FFFs). UltraFastBERT achieves comparable performance to BERT on downstream tasks while using only a small fraction (0.3%) of its neurons for each inference. |
This work is significant because it demonstrates the potential of conditional neural execution for significant speed improvements in large language models. By showing that only a small portion of neurons are necessary for individual inferences, it challenges the current paradigm of dense computation in these models and opens the door for more efficient implementations. |
The authors developed UltraFastBERT by replacing the feedforward layers in crammedBERT with FFFs, organizing neurons into a binary tree and conditionally activating only one branch per inference. They trained various UltraFastBERT configurations on the GLUE benchmark, comparing their performance against BERT-base and crammedBERT. They also implemented and evaluated different CPU and GPU inference implementations to assess the speedup from using FFFs. |
UltraFastBERT achieved comparable performance to BERT-base on the GLUE benchmark, retaining at least 96% of its performance while using only 0.3% of the neurons for inference. The naive implementation of conditional matrix multiplication (CMM) in FFFs resulted in a speedup of up to 78x on CPUs over standard feedforward layers. While a fully optimized CMM implementation is not yet available, the results highlight the potential for significant speed improvements in language modeling. |
The authors acknowledge the limitations in the current implementation of CMM, which relies on high-level linear algebra routines and lacks support for efficient vector-level sparsity. Future work includes developing native and optimized implementations of CMM for both CPUs and GPUs, potentially by introducing hybrid vector-level sparse tensors in deep learning libraries and dedicated device programming interfaces. This would enable fully realizing the potential speedup demonstrated by UltraFastBERT. |
llm, bert, diffusion_model, analysis, performance, optimization, conditional_computation |
2404.15653 |
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data |
Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi, Mehrdad Farajtabar, Oncel Tuzel, Mohammad Rastegari |
Contrastive learning has emerged as a transformative method for learning
effective visual representations through the alignment of image and text
embeddings. However, pairwise similarity computation in contrastive loss
between image and text pairs poses computational challenges. This paper
presents a novel weakly supervised pre-training of vision models on web-scale
image-text data. The proposed method reframes pre-training on image-text data
as a classification task. Consequently, it eliminates the need for pairwise
similarity computations in contrastive loss, achieving a remarkable $2.7\times$
acceleration in training speed compared to contrastive learning on web-scale
data. Through extensive experiments spanning diverse vision tasks, including
detection and segmentation, we demonstrate that the proposed method maintains
high representation quality. Our source code along with pre-trained model
weights and training recipes is available at
\url{https://github.com/apple/corenet}. |
This paper introduces \method, a novel weakly supervised approach for pre-training vision models on web-scale image-text data by reframing it as a classification task, achieving a 2.7x speedup over contrastive learning methods like CLIP while maintaining comparable downstream performance. |
This paper is important because it addresses the computational bottleneck of contrastive learning in image-text pre-training, making it significantly faster and more efficient without compromising accuracy. This is crucial for enabling wider access to and faster research in large-scale pre-training. |
The authors extracted nouns from text captions, mapped them to WordNet synsets, and trained vision models using binary cross-entropy loss, essentially treating pre-training as a multi-label classification problem. They experimented with various ViT backbones, scaling data and models, and compared their method to CLIP on downstream tasks like image classification, multi-label classification, semantic segmentation, and object detection. |
Key findings include: (1) \method is 2.7x faster than CLIP while achieving comparable accuracy. (2) Scaling data and model size in \method improves downstream performance. (3) \method enables data-efficient transfer learning by leveraging the pre-trained classifier for initialization. (4) \method generalizes well to complex visual tasks like multi-label classification, semantic segmentation, and object detection, demonstrating the quality of learned representations. |
The paper acknowledges that while \method achieves promising results, the performance of the largest ViT model starts to saturate on larger datasets, suggesting potential limitations in scaling. Future work could explore longer training, leveraging even larger datasets, or incorporating techniques from contrastive learning to further improve \method's performance. |
diffusion_model, analysis, image_classification, multi-label_classification, semantic_segmentation, object_detection, weakly_supervised_learning, pre-training, vision_transformer, data_efficiency, web-scale_data |
2404.05595 |
UniFL: Improve Stable Diffusion via Unified Feedback Learning |
Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Weilin Huang, Min Zheng, Lean Fu, Guanbin Li |
Diffusion models have revolutionized the field of image generation, leading
to the proliferation of high-quality models and diverse downstream
applications. However, despite these significant advancements, the current
competitive solutions still suffer from several limitations, including inferior
visual quality, a lack of aesthetic appeal, and inefficient inference, without
a comprehensive solution in sight. To address these challenges, we present
UniFL, a unified framework that leverages feedback learning to enhance
diffusion models comprehensively. UniFL stands out as a universal, effective,
and generalizable solution applicable to various diffusion models, such as
SD1.5 and SDXL. Notably, UniFL incorporates three key components: perceptual
feedback learning, which enhances visual quality; decoupled feedback learning,
which improves aesthetic appeal; and adversarial feedback learning, which
optimizes inference speed. In-depth experiments and extensive user studies
validate the superior performance of our proposed method in enhancing both the
quality of generated models and their acceleration. For instance, UniFL
surpasses ImageReward by 17% user preference in terms of generation quality and
outperforms LCM and SDXL Turbo by 57% and 20% in 4-step inference. Moreover, we
have verified the efficacy of our approach in downstream tasks, including Lora,
ControlNet, and AnimateDiff. |
This paper introduces UniFL, a novel unified feedback learning framework for improving text-to-image diffusion models. UniFL aims to address limitations in existing models, such as inferior visual quality, lack of aesthetic appeal, and inefficient inference. |
This paper is important because it presents a comprehensive solution to improve text-to-image diffusion models in multiple aspects. By leveraging various feedback learning techniques, UniFL enhances the visual quality, aesthetic appeal, and inference speed of diffusion models, which are crucial for broader applications and user satisfaction. |
UniFL achieves its goals through three key components: (1) Perceptual Feedback Learning (PeFL) leverages existing visual perception models (e.g., VGG, instance segmentation models) to enhance specific visual aspects like style and structure. (2) Decoupled Feedback Learning utilizes separate reward models for different aesthetic dimensions (e.g., color, layout, lighting, detail) and incorporates an active prompt selection strategy to mitigate overfitting. (3) Adversarial Feedback Learning treats the reward model as a discriminator in adversarial training, enabling optimization for faster inference without sacrificing quality. |
UniFL demonstrates superior performance in both quantitative and qualitative evaluations. It outperforms competitive methods like ImageReward, DreamShaper, and DPO in terms of FID, CLIP Score, and aesthetic scores on SD1.5 and SDXL architectures. User studies confirm UniFL's superiority in generation quality and acceleration, surpassing LCM, SDXL-Turbo, and SDXL-Lightning. Notably, UniFL shows promising generalization capabilities, effectively transferring its improvements to downstream tasks like LoRA, ControlNet, and AnimateDiff. |
The authors identify several limitations and future work directions: exploring larger and more advanced visual perception models for enhanced supervision, further improving acceleration towards one-step inference, and streamlining the current two-stage optimization process into a single-stage approach. |
diffusion_model, feedback_learning, acceleration, aesthetic, quality, inference, text-to-image, perceptual_loss, adversarial_training |
2403.12963 |
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis |
Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, Hongsheng Li |
In this study, we delve into the generation of high-resolution images from
pre-trained diffusion models, addressing persistent challenges, such as
repetitive patterns and structural distortions, that emerge when models are
applied beyond their trained resolutions. To address this issue, we introduce
an innovative, training-free approach FouriScale from the perspective of
frequency domain analysis. We replace the original convolutional layers in
pre-trained diffusion models by incorporating a dilation technique along with a
low-pass operation, intending to achieve structural consistency and scale
consistency across resolutions, respectively. Further enhanced by a
padding-then-crop strategy, our method can flexibly handle text-to-image
generation of various aspect ratios. By using the FouriScale as guidance, our
method successfully balances the structural integrity and fidelity of generated
images, achieving an astonishing capacity of arbitrary-size, high-resolution,
and high-quality generation. With its simplicity and compatibility, our method
can provide valuable insights for future explorations into the synthesis of
ultra-high-resolution images. The code will be released at
https://github.com/LeonHLJ/FouriScale. |
This paper introduces FouriScale, a training-free method to enable pre-trained diffusion models to synthesize high-resolution images without repetitive patterns or structural distortions. |
The paper addresses a critical limitation of diffusion models, which are typically trained at fixed resolutions, hindering their ability to generate high-quality images at arbitrary sizes. FouriScale offers a simple yet effective solution to this problem, making it highly relevant for various applications requiring high-resolution image generation. |
FouriScale modifies the convolutional layers within the diffusion model's UNet architecture. It replaces standard convolutions with a combination of dilated convolutions and low-pass filtering to achieve structural and scale consistency across resolutions. It utilizes a padding-then-cropping strategy to generate images with arbitrary aspect ratios and introduces FouriScale guidance for enhanced image quality. |
FouriScale effectively mitigates pattern repetition and distortions in high-resolution image synthesis, outperforming other training-free methods like Attn-Entro and ScaleCrafter. It exhibits consistent performance across different pre-trained models like SD 1.5, SD 2.1, and SDXL, demonstrating its robustness and generalizability. Quantitative evaluations using FID and KID demonstrate its superior performance over baselines. |
The authors acknowledge that FouriScale encounters limitations in generating ultra-high-resolution images (e.g., 4096x4096) where artifacts may arise. Additionally, its reliance on convolutional operations restricts its application to purely transformer-based diffusion models. Future work may explore extending FouriScale for ultra-high resolution and adapting it for transformer architectures. |
diffusion_model, image_synthesis, high_resolution, training-free, frequency_domain, convolutional_neural_networks, generative_models |
2310.16834 |
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution |
Aaron Lou, Chenlin Meng, Stefano Ermon |
Despite their groundbreaking performance for many generative modeling tasks,
diffusion models have fallen short on discrete data domains such as natural
language. Crucially, standard diffusion models rely on the well-established
theory of score matching, but efforts to generalize this to discrete structures
have not yielded the same empirical gains. In this work, we bridge this gap by
proposing score entropy, a novel loss that naturally extends score matching to
discrete spaces, integrates seamlessly to build discrete diffusion models, and
significantly boosts performance. Experimentally, we test our Score Entropy
Discrete Diffusion models (SEDD) on standard language modeling tasks. For
comparable model sizes, SEDD beats existing language diffusion paradigms
(reducing perplexity by $25$-$75$\%) and is competitive with autoregressive
models, in particular outperforming GPT-2. Furthermore, compared to
autoregressive mdoels, SEDD generates faithful text without requiring
distribution annealing techniques like temperature scaling (around
$6$-$8\times$ better generative perplexity than un-annealed GPT-2), can trade
compute and quality (similar quality with $32\times$ fewer network
evaluations), and enables controllable infilling (matching nucleus sampling
quality while enabling other strategies besides left to right prompting). |
This paper introduces Score Entropy Discrete Diffusion (SEDD), a novel approach for building discrete diffusion models parameterized by the ratios of the data distribution, aiming to address the limitations of existing diffusion models in handling discrete data like natural language. |
The paper is important because it presents a novel method for discrete diffusion models that outperforms previous models in language modeling tasks, challenges the dominance of autoregressive models, and offers advantages like faster, controllable, and higher-quality generation without relying on distribution annealing techniques. |
The authors develop a novel loss function called score entropy, analogous to score matching used in continuous diffusion models. They use this loss to train a seq-to-seq transformer model on various language modeling tasks like text8, One Billion Words, and GPT-2 zero-shot tasks. They evaluate their model's performance on perplexity and generation quality, comparing it against existing diffusion models and autoregressive models like GPT-2. |
SEDD significantly outperforms previous discrete diffusion models on language modeling benchmarks and achieves competitive perplexity scores compared to autoregressive models, even surpassing GPT-2 on some tasks. Furthermore, SEDD generates higher-quality text without distribution annealing techniques and allows for flexible conditional generation, including infilling, matching the performance of models that rely on such techniques. |
The paper acknowledges limitations such as the gap with modern large language models and the need for exploring better distribution annealing techniques for SEDD. Future work could focus on closing the performance gap with larger LMs, adapting empirical designs from continuous diffusion models, and systematically exploring noise schedules and loss weightings for further improvement. |
diffusion_model, llm, analysis, language_modeling, text_generation |
2404.02883 |
On the Scalability of Diffusion-based Text-to-Image Generation |
Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto |
Scaling up model and data size has been quite successful for the evolution of
LLMs. However, the scaling law for the diffusion based text-to-image (T2I)
models is not fully explored. It is also unclear how to efficiently scale the
model for better performance at reduced cost. The different training settings
and expensive training cost make a fair model comparison extremely difficult.
In this work, we empirically study the scaling properties of diffusion based
T2I models by performing extensive and rigours ablations on scaling both
denoising backbones and training set, including training scaled UNet and
Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M
images. For model scaling, we find the location and amount of cross attention
distinguishes the performance of existing UNet designs. And increasing the
transformer blocks is more parameter-efficient for improving text-image
alignment than increasing channel numbers. We then identify an efficient UNet
variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data
scaling side, we show the quality and diversity of the training set matters
more than simply dataset size. Increasing caption density and diversity
improves text-image alignment performance and the learning efficiency. Finally,
we provide scaling functions to predict the text-image alignment performance as
functions of the scale of model size, compute and dataset size. |
This paper investigates the scaling properties of diffusion-based text-to-image models, focusing on the denoising backbone and dataset size to understand how to design and train these models effectively. |
This work is important because it provides insights into the design and training of large-scale text-to-image models, which are computationally expensive to develop. The findings offer practical guidance for improving performance and efficiency in this domain. |
The authors conducted controlled experiments by training various UNet and Transformer architectures with different sizes and configurations. They also curated and used large-scale datasets, analyzing the impact of dataset size, quality, and caption enhancement on model performance. Key metrics like TIFA, ImageReward, FID, CLIP score, and HPSv2 were used to evaluate the models. |
The paper demonstrates that SDXL's UNet design is superior to its counterparts, and strategically increasing its transformer depth is more parameter-efficient for better text-image alignment than solely increasing channel numbers. Additionally, they identified an efficient UNet variant with 45% fewer parameters and 28% faster inference than SDXL, achieving comparable performance. The study also highlights that dataset quality matters more than size, and augmenting datasets with synthetic captions significantly improves training efficiency and performance. |
The paper acknowledges limitations in training Transformers from scratch due to a lack of inductive bias compared to UNets, suggesting further exploration of architectural improvements for Transformers in future work. Additionally, while the study provides valuable insights into scaling laws for text-to-image models, it acknowledges the need for further investigation with even larger models and datasets. |
diffusion_model, gan, analysis, text-to-image, unet, transformer, scaling_law, dataset, caption, efficiency |
2401.12945 |
Lumiere: A Space-Time Diffusion Model for Video Generation |
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri |
We introduce Lumiere -- a text-to-video diffusion model designed for
synthesizing videos that portray realistic, diverse and coherent motion -- a
pivotal challenge in video synthesis. To this end, we introduce a Space-Time
U-Net architecture that generates the entire temporal duration of the video at
once, through a single pass in the model. This is in contrast to existing video
models which synthesize distant keyframes followed by temporal super-resolution
-- an approach that inherently makes global temporal consistency difficult to
achieve. By deploying both spatial and (importantly) temporal down- and
up-sampling and leveraging a pre-trained text-to-image diffusion model, our
model learns to directly generate a full-frame-rate, low-resolution video by
processing it in multiple space-time scales. We demonstrate state-of-the-art
text-to-video generation results, and show that our design easily facilitates a
wide range of content creation tasks and video editing applications, including
image-to-video, video inpainting, and stylized generation. |
This paper introduces Lumiere, a text-to-video diffusion model that synthesizes videos with realistic and coherent motion by generating the entire temporal duration at once using a novel Space-Time U-Net architecture. |
This paper addresses a critical challenge in video synthesis: generating videos with realistic and coherent motion over extended durations. It deviates from the prevalent cascaded approach and proposes a novel architecture that significantly improves the quality and coherence of generated motion in videos. |
The authors propose a Space-Time U-Net (STUNet) that processes and generates the entire video simultaneously by downsampling and upsampling the video in both space and time. This architecture leverages a pre-trained text-to-image diffusion model and employs Multidiffusion for spatial super-resolution to ensure temporal consistency. |
Lumiere demonstrates state-of-the-art results in text-to-video generation, producing high-quality videos with superior motion coherence and visual fidelity compared to existing methods. It also exhibits strong performance in various downstream tasks, including image-to-video generation, video inpainting, and stylized generation. |
The paper acknowledges limitations in generating multi-shot videos or those involving scene transitions. Future work could explore extending Lumiere to address these limitations and investigate its application to latent video diffusion models. |
diffusion_model, video, motion, text-to-video, video_generation, image-to-video, video_inpainting, stylized_generation |
2311.01462 |
Idempotent Generative Network |
Assaf Shocher, Amil Dravid, Yossi Gandelsman, Inbar Mosseri, Michael Rubinstein, Alexei A. Efros |
We propose a new approach for generative modeling based on training a neural
network to be idempotent. An idempotent operator is one that can be applied
sequentially without changing the result beyond the initial application, namely
$f(f(z))=f(z)$. The proposed model $f$ is trained to map a source distribution
(e.g, Gaussian noise) to a target distribution (e.g. realistic images) using
the following objectives: (1) Instances from the target distribution should map
to themselves, namely $f(x)=x$. We define the target manifold as the set of all
instances that $f$ maps to themselves. (2) Instances that form the source
distribution should map onto the defined target manifold. This is achieved by
optimizing the idempotence term, $f(f(z))=f(z)$ which encourages the range of
$f(z)$ to be on the target manifold. Under ideal assumptions such a process
provably converges to the target distribution. This strategy results in a model
capable of generating an output in one step, maintaining a consistent latent
space, while also allowing sequential applications for refinement.
Additionally, we find that by processing inputs from both target and source
distributions, the model adeptly projects corrupted or modified data back to
the target manifold. This work is a first step towards a ``global projector''
that enables projecting any input into a target data distribution. |
This paper introduces Idempotent Generative Networks (IGN), a novel approach to generative modeling that trains a neural network to be idempotent, meaning applying it repeatedly yields the same result as the initial application. |
This paper is significant because it presents a new perspective on generative modeling with unique advantages: one-step generation, optional sequential refinement, consistent latent space, and the potential for acting as a "global projector" to map various input distributions onto a target manifold. |
The authors propose a training methodology with three key objectives: 1) Reconstruction: Data samples should be mapped to themselves. 2) Idempotence: Applying the network twice should yield the same result as applying it once. 3) Tightness: The set of instances mapped to themselves should be minimized. They achieve this through a novel self-adversarial training scheme using a single network. |
The paper provides theoretical guarantees of IGN's convergence to the target distribution under ideal conditions. Experiments on MNIST and CelebA datasets demonstrate IGN's ability to generate realistic images from noise, perform latent space manipulations, and project out-of-distribution images (noisy, grayscale, sketches) onto the learned image manifold. |
The authors acknowledge limitations such as mode collapse and blurriness in generated images, suggesting potential solutions like GAN mode collapse prevention techniques and perceptual or two-step loss functions. Future work aims to scale up IGN by training on larger datasets to explore its full potential. |
diffusion_model, gan, generative_model, idempotence, image_generation, latent_space, projection, out-of-distribution |
2405.01496 |
LocInv: Localization-aware Inversion for Text-Guided Image Editing |
Chuanming Tang, Kai Wang, Fei Yang, Joost van de Weijer |
Large-scale Text-to-Image (T2I) diffusion models demonstrate significant
generation capabilities based on textual prompts. Based on the T2I diffusion
models, text-guided image editing research aims to empower users to manipulate
generated images by altering the text prompts. However, existing image editing
techniques are prone to editing over unintentional regions that are beyond the
intended target area, primarily due to inaccuracies in cross-attention maps. To
address this problem, we propose Localization-aware Inversion (LocInv), which
exploits segmentation maps or bounding boxes as extra localization priors to
refine the cross-attention maps in the denoising phases of the diffusion
process. Through the dynamic updating of tokens corresponding to noun words in
the textual input, we are compelling the cross-attention maps to closely align
with the correct noun and adjective words in the text prompt. Based on this
technique, we achieve fine-grained image editing over particular objects while
preventing undesired changes to other regions. Our method LocInv, based on the
publicly available Stable Diffusion, is extensively evaluated on a subset of
the COCO dataset, and consistently obtains superior results both quantitatively
and qualitatively.The code will be released at
https://github.com/wangkai930418/DPL |
This paper introduces Localization-aware Inversion (LocINV), a novel method for text-guided image editing that leverages localization priors like segmentation maps or bounding boxes to enhance the accuracy of cross-attention maps in diffusion models, thereby improving the precision of object manipulation and attribute editing. |
This paper addresses the crucial issue of cross-attention leakage in text-guided image editing with diffusion models. Existing methods often struggle to precisely edit intended objects, leading to unintended alterations in other image regions. LocINV tackles this problem by incorporating readily available localization information, leading to more accurate and controllable image editing. |
LocINV utilizes pre-trained Stable Diffusion models and incorporates localization priors (segmentation maps or bounding boxes) obtained from datasets or foundation models. By dynamically updating tokens associated with noun words during the denoising process, it refines cross-attention maps, enforcing better alignment with target objects. Additionally, for attribute editing, LocINV introduces an adjective binding loss to align adjective representations with corresponding nouns, improving the model's ability to edit object attributes. |
Through extensive evaluations on a subset of the COCO dataset, LocINV consistently outperforms existing text-guided image editing methods in both quantitative metrics (LPIPS, SSIM, PSNR, CLIP Score, DINO-Sim) and qualitative comparisons. The method shows superior performance in local object Word-Swap tasks, preserving background integrity while accurately replacing target objects. Notably, LocINV demonstrates the novel capability for Attribute-Edit, successfully modifying object colors and materials by binding adjective and noun representations, a feature unexplored by most existing methods. |
The authors acknowledge limitations related to the resolution of cross-attention maps, the editing capabilities of frozen Stable Diffusion models, and challenges in reconstructing high-frequency image details. Future work aims to explore pixel-level text-to-image models for finer control, integrate techniques like InstructPix2Pix for enhanced editing, and address limitations in reconstructing intricate image details. |
diffusion_model, image_editing, text-guided, cross-attention, localization, segmentation, bounding_box, stable_diffusion, attribute_editing, word-swap |
2401.12086 |
West-of-N: Synthetic Preference Generation for Improved Reward Modeling |
Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn |
The success of reinforcement learning from human feedback (RLHF) in language
model alignment is strongly dependent on the quality of the underlying reward
model. In this paper, we present a novel approach to improve reward model
quality by generating synthetic preference data, thereby augmenting the
training dataset with on-policy, high-quality preference pairs. Motivated by
the promising results of Best-of-N sampling strategies in language model
training, we extend their application to reward model training. This results in
a self-training strategy to generate preference pairs by selecting the best and
worst candidates in a pool of responses to a given query. Empirically, we find
that this approach improves the performance of any reward model, with an effect
comparable to the addition of a similar quantity of human preference data. This
work opens up new avenues of research for improving RLHF for language model
alignment, by offering synthetic preference generation as a solution to reward
modeling challenges. |
This paper introduces West-of-N, a novel method for improving reward models in Reinforcement Learning from Human Feedback (RLHF) by generating synthetic preference data using Best-of-N sampling from a language model. |
This work addresses the critical bottleneck of data scarcity in RLHF by proposing a scalable method for generating high-quality, on-policy preference data, potentially reducing the reliance on expensive and time-consuming human annotations. |
The authors propose a self-training strategy where a base preference model (trained on initial data) is used to select the most and least preferred responses (West-of-N) from a pool generated by the language model. These synthetic preferences are then used to train a more accurate reward model. |
Empirical results show that West-of-N significantly improves reward model accuracy and downstream language model alignment, outperforming baseline methods like RLAIF and RLCD. Notably, the gains from West-of-N are comparable to doubling the amount of human preference data. |
Limitations include potential reward hacking by the base model when identifying West-of-N pairs with very large N. Future work could address this through reward model uncertainty estimation. Additionally, exploring other self-training techniques from the literature could further enhance West-of-N. |
diffusion_model, llm, rlhf, preference_modeling, synthetic_data, self_training, best-of-n, reward_modeling |
2403.02327 |
Model Lakes |
Koyena Pal, David Bau, Renée J. Miller |
Given a set of deep learning models, it can be hard to find models
appropriate to a task, understand the models, and characterize how models are
different one from another. Currently, practitioners rely on manually-written
documentation to understand and choose models. However, not all models have
complete and reliable documentation. As the number of machine learning models
increases, this issue of finding, differentiating, and understanding models is
becoming more crucial. Inspired from research on data lakes, we introduce and
define the concept of model lakes. We discuss fundamental research challenges
in the management of large models. And we discuss what principled data
management techniques can be brought to bear on the study of large model
management. |
This paper introduces the concept of "model lakes" as a way to manage and understand the growing number of deep learning models, drawing parallels to data lakes in the data management field. |
This paper is important because it addresses the difficulty in finding, understanding, and comparing deep learning models due to the reliance on often incomplete or unreliable manual documentation. It proposes model lakes, inspired by data lakes, as a potential solution to these challenges. |
This paper presents a vision paper that draws analogies from data management literature, particularly data lakes, and proposes a roadmap for future research in model management. It does not perform any experiments. |
The paper doesn't have experimental results, being a vision paper. However, it proposes a model lake framework, outlines key challenges like content-based model search, related model search, documentation verification, data citation, provenance, version control, and discusses potential approaches inspired by solutions in data management for data lakes. |
The authors identify limitations in current model management practices, including reliance on incomplete metadata and manual documentation. They propose future work on content-based model search, automated documentation verification, data citation for models, model provenance tracking, and model version management, emphasizing the need for standardized benchmarks and evaluation metrics. |
model_lake, model_management, model_search, model_provenance, model_versioning, analysis, literature_review |
2312.00785 |
Sequential Modeling Enables Scalable Learning for Large Vision Models |
Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros |
We introduce a novel sequential modeling approach which enables learning a
Large Vision Model (LVM) without making use of any linguistic data. To do this,
we define a common format, "visual sentences", in which we can represent raw
images and videos as well as annotated data sources such as semantic
segmentations and depth reconstructions without needing any meta-knowledge
beyond the pixels. Once this wide variety of visual data (comprising 420
billion tokens) is represented as sequences, the model can be trained to
minimize a cross-entropy loss for next token prediction. By training across
various scales of model architecture and data diversity, we provide empirical
evidence that our models scale effectively. Many different vision tasks can be
solved by designing suitable visual prompts at test time. |
This paper introduces a Large Vision Model (LVM) trained solely on a massive dataset of visual data, formatted as "visual sentences," without relying on any linguistic information. This approach involves tokenizing images into discrete tokens using VQGAN and training a causal transformer model to predict the next token, enabling various vision tasks to be performed through visual prompting. |
This paper is significant as it explores the potential of building large vision models analogous to large language models, demonstrating that visual understanding can be achieved without relying on language data. It pushes the boundaries of self-supervised learning in vision and paves the way for more general and scalable visual models capable of handling diverse tasks through in-context learning. |
The authors curated a massive, diverse dataset of visual data called UVDv1, encompassing single images, image sequences, annotated images, annotated image sequences, and 3D synthetic objects, totaling 1.64 billion images. They introduced the concept of "visual sentences" to unify various data formats, treating each sentence as a sequence of visual tokens generated by a VQGAN tokenizer. A causal transformer model was trained to predict the next token in the sequence, enabling in-context learning for downstream tasks through visual prompting. |
The paper demonstrates that the LVM exhibits strong scaling behavior, with larger models and more data leading to better performance on various vision tasks such as semantic segmentation, depth estimation, and keypoint detection, even outperforming some task-specific models on unseen datasets. The model also showcases an ability to generalize to novel tasks, handle out-of-distribution data, and perform basic visual reasoning, suggesting potential for more advanced visual understanding. |
The authors acknowledge limitations such as computational constraints, under-constrained visual prompting compared to language, tokenizer limitations, and the relatively small size of the LVM compared to LLMs. Future work includes scaling up the model and exploring its capabilities in visual reasoning, emergence, and generalization. |
diffusion_model, llm, analysis, 3d, motion, video, interpretability |
2308.15321 |
Elucidating the Exposure Bias in Diffusion Models |
Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, Itir Onal Ertugrul |
Diffusion models have demonstrated impressive generative capabilities, but
their \textit{exposure bias} problem, described as the input mismatch between
training and sampling, lacks in-depth exploration. In this paper, we
systematically investigate the exposure bias problem in diffusion models by
first analytically modelling the sampling distribution, based on which we then
attribute the prediction error at each sampling step as the root cause of the
exposure bias issue. Furthermore, we discuss potential solutions to this issue
and propose an intuitive metric for it. Along with the elucidation of exposure
bias, we propose a simple, yet effective, training-free method called Epsilon
Scaling to alleviate the exposure bias. We show that Epsilon Scaling explicitly
moves the sampling trajectory closer to the vector field learned in the
training phase by scaling down the network output, mitigating the input
mismatch between training and sampling. Experiments on various diffusion
frameworks (ADM, DDIM, EDM, LDM, DiT, PFGM++) verify the effectiveness of our
method. Remarkably, our ADM-ES, as a state-of-the-art stochastic sampler,
obtains 2.17 FID on CIFAR-10 under 100-step unconditional generation. The code
is available at \url{https://github.com/forever208/ADM-ES} and
\url{https://github.com/forever208/EDM-ES}. |
This paper investigates the exposure bias problem in diffusion models, where the input mismatch between training and sampling leads to error accumulation and sampling drift. The paper analyzes the sampling distribution with prediction error, proposes a metric for quantifying exposure bias, and introduces Epsilon Scaling, a training-free method for alleviating this issue by scaling down the network output during sampling. |
The paper is important because it provides an in-depth analysis of the exposure bias problem in diffusion models, which is a key factor affecting sample quality, especially in fast sampling scenarios. The proposed Epsilon Scaling method offers a simple yet effective solution to improve sample quality without retraining, making it widely applicable across different diffusion model architectures and samplers. |
The authors first analytically model the sampling distribution by considering the prediction error. Then, they propose a metric (variance error) to quantify the exposure bias at each timestep. To address the exposure bias issue, they propose Epsilon Scaling, a training-free method that scales down the network output (epsilon) during sampling based on a linear schedule derived from the accumulated error. The authors evaluate their method using FID scores on various datasets (CIFAR-10, LSUN, FFHQ, ImageNet) and diffusion frameworks (ADM, DDIM, DDPM, EDM, LDM, DiT, PFGM++). |
Epsilon Scaling consistently improves FID scores across various diffusion frameworks, datasets, and conditional settings. For instance, ADM-ES obtains 2.17 FID on CIFAR-10 under 100-step unconditional generation, outperforming previous state-of-the-art stochastic samplers. Epsilon Scaling is shown to effectively reduce exposure bias by moving the sampling trajectory closer to the vector field learned during training. The method exhibits insensitivity to the scaling parameter, requiring minimal effort to search for an optimal value. |
The authors acknowledge that Epsilon Scaling corrects only the magnitude error of the network prediction, not the direction error, implying there is still room for improvement. Future work could focus on exploring methods to further reduce the exposure bias by addressing the direction error. Another avenue for future work is investigating the effectiveness of Epsilon Scaling on other diffusion-based applications beyond image generation, such as audio and video generation. |
diffusion_model, exposure_bias, sampling, fid, analysis, training-free, image_generation, adm, ddim, ddpm, edm, ldm, dit, pfgm++ |
2402.17177 |
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models |
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, Lichao Sun |
Sora is a text-to-video generative AI model, released by OpenAI in February
2024. The model is trained to generate videos of realistic or imaginative
scenes from text instructions and show potential in simulating the physical
world. Based on public technical reports and reverse engineering, this paper
presents a comprehensive review of the model's background, related
technologies, applications, remaining challenges, and future directions of
text-to-video AI models. We first trace Sora's development and investigate the
underlying technologies used to build this "world simulator". Then, we describe
in detail the applications and potential impact of Sora in multiple industries
ranging from film-making and education to marketing. We discuss the main
challenges and limitations that need to be addressed to widely deploy Sora,
such as ensuring safe and unbiased video generation. Lastly, we discuss the
future development of Sora and video generation models in general, and how
advancements in the field could enable new ways of human-AI interaction,
boosting productivity and creativity of video generation. |
This paper provides a comprehensive review of Sora, OpenAI's text-to-video generation model, exploring its background, related technologies, potential applications, limitations, and future directions. |
Sora represents a significant breakthrough in AI, demonstrating the ability to generate high-quality, minute-long videos from text prompts, thus marking a milestone in AI-powered video generation and opening up possibilities in various fields. |
The paper combines analysis of published technical reports and reverse engineering based on existing literature to dissect Sora's architecture, training methodologies, and capabilities. |
The authors provide insights into Sora's architecture, including data pre-processing, the use of diffusion transformers, language instruction following, and prompt engineering. They highlight Sora's ability to handle variable video durations and resolutions, simulate complex scenes, and produce high-quality videos, while also pointing out current limitations in physical realism and human-computer interaction. |
The paper identifies limitations like challenges in accurately depicting complex physical interactions, maintaining temporal accuracy, and limitations in user control for detailed modifications. It suggests future research directions such as exploring more robust training datasets, improving realism in physical simulations, and enhancing user interaction capabilities for finer control over video generation. |
diffusion_model, llm, analysis, video, sora |
2404.15447 |
GLoD: Composing Global Contexts and Local Details in Image Generation |
Moyuru Yamada |
Diffusion models have demonstrated their capability to synthesize
high-quality and diverse images from textual prompts. However, simultaneous
control over both global contexts (e.g., object layouts and interactions) and
local details (e.g., colors and emotions) still remains a significant
challenge. The models often fail to understand complex descriptions involving
multiple objects and reflect specified visual attributes to wrong targets or
ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a
novel framework which allows simultaneous control over the global contexts and
the local details in text-to-image generation without requiring training or
fine-tuning. It assigns multiple global and local prompts to corresponding
layers and composes their noises to guide a denoising process using pre-trained
diffusion models. Our framework enables complex global-local compositions,
conditioning objects in the global prompt with the local prompts while
preserving other unspecified identities. Our quantitative and qualitative
evaluations demonstrate that GLoD effectively generates complex images that
adhere to both user-provided object interactions and object details. |
This paper introduces Global-Local Diffusion (GLoD), a novel framework for controllable text-to-image generation using diffusion models, which allows simultaneous control over global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) by composing multiple global and local prompts. |
This paper addresses the limitation of existing diffusion-based text-to-image generation methods that struggle to simultaneously control both global and local aspects of the generated image. GLoD offers a training-free approach for more complex and controllable image synthesis, which is crucial for real-world applications. |
GLoD leverages pre-trained diffusion models and utilizes a novel layer composition approach. It takes global and local prompts as input, generates separate noises for each prompt, and then composes them effectively using global and local guidance mechanisms. This allows the model to incorporate both global context and local details into the generated image. |
GLoD demonstrates superior performance in generating complex images that adhere to both global contexts and local details specified by the user. Quantitative evaluation shows improved alignment scores for both global and local attributes compared to existing methods, demonstrating better controllability. GLoD also effectively reduces undesirable attribute interference between objects in a scene. |
One limitation identified is the potential for partial object appearance changes when the latent representation of the object differs significantly between the global and local prompts. Future work could explore techniques to mitigate this issue. Additionally, expanding the framework to handle more complex relationships between objects and exploring its application to other domains like video or 3D object generation are promising directions. |
diffusion_model, text-to-image generation, controllable image synthesis, global context, local detail, layer composition, training-free |
2404.12382 |
Lazy Diffusion Transformer for Interactive Image Editing |
Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi |
We introduce a novel diffusion transformer, LazyDiffusion, that generates
partial image updates efficiently. Our approach targets interactive image
editing applications in which, starting from a blank canvas or an image, a user
specifies a sequence of localized image modifications using binary masks and
text prompts. Our generator operates in two phases. First, a context encoder
processes the current canvas and user mask to produce a compact global context
tailored to the region to generate. Second, conditioned on this context, a
diffusion-based transformer decoder synthesizes the masked pixels in a "lazy"
fashion, i.e., it only generates the masked region. This contrasts with
previous works that either regenerate the full canvas, wasting time and
computation, or confine processing to a tight rectangular crop around the mask,
ignoring the global image context altogether. Our decoder's runtime scales with
the mask size, which is typically small, while our encoder introduces
negligible overhead. We demonstrate that our approach is competitive with
state-of-the-art inpainting methods in terms of quality and fidelity while
providing a 10x speedup for typical user interactions, where the editing mask
represents 10% of the image. |
This paper introduces Gazelle, a novel diffusion transformer model designed for efficient partial image generation, particularly targeting interactive image editing applications like inpainting. |
This work is important as it addresses the inefficiency of traditional inpainting methods that regenerate the entire image, even when editing small portions. Gazelle offers a significant speedup for localized edits while maintaining global consistency, making diffusion models more practical for interactive workflows. |
The authors propose a two-stage approach: 1) A context encoder processes the entire image and mask to extract a compact global context specific to the masked region. 2) A diffusion-based transformer decoder iteratively generates only the masked pixels, conditioned on this context and the user's text prompt. This approach ensures global coherence while significantly reducing computational cost by focusing solely on the area of interest. |
Gazelle achieves a speedup of up to 10x compared to full-image inpainting methods for masks covering 10% of the image. It demonstrates competitive quality with state-of-the-art inpainting models, especially in scenarios requiring high semantic context, indicating the effectiveness of its compressed context representation. User studies confirm a strong preference for Gazelle over crop-based methods and comparable preference to full-image methods. |
The authors acknowledge limitations regarding the context encoder's quadratic scaling with input size, potentially limiting scalability to ultra-high-resolution images. They also identify occasional color inconsistencies between generated and visible regions. Future work could explore more efficient context encoding mechanisms and more principled solutions for seamless blending. |
diffusion_model, transformer, inpainting, image_editing, interactive, context_encoding, latent_space, efficiency, poisson_blending |
2401.04056 |
A Minimaximalist Approach to Reinforcement Learning from Human Feedback |
Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal |
We present Self-Play Preference Optimization (SPO), an algorithm for
reinforcement learning from human feedback. Our approach is minimalist in that
it does not require training a reward model nor unstable adversarial training
and is therefore rather simple to implement. Our approach is maximalist in that
it provably handles non-Markovian, intransitive, and stochastic preferences
while being robust to the compounding errors that plague offline approaches to
sequential prediction. To achieve the preceding qualities, we build upon the
concept of a Minimax Winner (MW), a notion of preference aggregation from the
social choice theory literature that frames learning from preferences as a
zero-sum game between two policies. By leveraging the symmetry of this game, we
prove that rather than using the traditional technique of dueling two policies
to compute the MW, we can simply have a single agent play against itself while
maintaining strong convergence guarantees. Practically, this corresponds to
sampling multiple trajectories from a policy, asking a rater or preference
model to compare them, and then using the proportion of wins as the reward for
a particular trajectory. We demonstrate that on a suite of continuous control
tasks, we are able to learn significantly more efficiently than reward-model
based approaches while maintaining robustness to the intransitive and
stochastic preferences that frequently occur in practice when aggregating human
judgments. |
This paper theoretically investigates the sample complexity of offline imitation learning (IL) with a focus on the effects of dataset composition and function class complexity. |
The paper provides insights into crucial factors influencing the performance of IL algorithms, particularly in real-world scenarios where the quality and diversity of available data are significant concerns. |
The authors derive upper and lower bounds on the sample complexity of IL under various settings. They analyze different dataset compositions (e.g., mixtures of experts, behavior-agnostic data) and consider their impact on the learning guarantees. Moreover, they utilize covering numbers and Rademacher complexities to characterize the complexity of the function class. |
The paper demonstrates that learning is possible even with a small proportion of expert data mixed with a larger amount of sub-optimal data. They also show the impact of the function class complexity on sample efficiency. Notably, simpler function classes lead to faster learning. |
The authors acknowledge limitations regarding the tightness of the derived bounds, especially in high-dimensional settings. They suggest exploring tighter bounds and practically relevant function classes in future work. Furthermore, they encourage the development of more efficient algorithms based on the theoretical insights gained from this study. |
imitation learning, offline learning, sample complexity, dataset composition, function class complexity, covering number, analysis |
2309.09887 |
On Model Explanations with Transferable Neural Pathways |
Xinmiao Lin, Wentao Bao, Qi Yu, Yu Kong |
Neural pathways as model explanations consist of a sparse set of neurons that
provide the same level of prediction performance as the whole model. Existing
methods primarily focus on accuracy and sparsity but the generated pathways may
offer limited interpretability thus fall short in explaining the model
behavior. In this paper, we suggest two interpretability criteria of neural
pathways: (i) same-class neural pathways should primarily consist of
class-relevant neurons; (ii) each instance's neural pathway sparsity should be
optimally determined. To this end, we propose a Generative Class-relevant
Neural Pathway (GEN-CNP) model that learns to predict the neural pathways from
the target model's feature maps. We propose to learn class-relevant information
from features of deep and shallow layers such that same-class neural pathways
exhibit high similarity. We further impose a faithfulness criterion for GEN-CNP
to generate pathways with instance-specific sparsity. We propose to transfer
the class-relevant neural pathways to explain samples of the same class and
show experimentally and qualitatively their faithfulness and interpretability. |
This paper introduces GEN-CNP, a novel method for generating class-relevant neural pathway explanations for image recognition models, aiming to improve the interpretability of model explanations while maintaining faithfulness to the original model. |
The paper addresses the limitations of existing neural pathway explanation methods that often lack interpretability and rely on global sparsity. It proposes class-wise and instance-specific interpretability concepts, enhancing the understanding of model behavior by revealing class-relevant features and allowing the transferability of explanations to other samples within the same class. |
The authors propose GEN-CNP, a model that learns to predict neural pathways from the target model's feature maps. GEN-CNP uses Recursive Feature Embedders (RFEs) to extract feature patterns and Pathway Distillation Network (PDN) to learn class-relevant information from them. It utilizes Recursive Pathway Decoders (RPDs) with Distance Aware Quantization (DAQ) to decode importance scores into sparse and faithful neural pathways. They train GEN-CNP using knowledge distillation with sparsity constraints to ensure faithfulness to the target model and generate sparse explanations. |
The proposed GEN-CNP method generates neural pathways with higher faithfulness to the original model, as demonstrated by improved performance on metrics like mIC and mDC. The generated pathways exhibit higher class-relevance, confirmed by higher acIOU scores and the transferability experiments, showing consistent and faithful explanations for samples within the same class. Qualitative visualizations using Grad-CAM and neural pathway gradients highlight that GEN-CNP identifies more semantically meaningful features compared to existing methods. |
The authors acknowledge limitations in terms of computational cost and the current implementation's focus on image recognition models. Future work could explore more computationally efficient architectures for GEN-CNP and extend its applicability to other domains beyond image recognition, such as natural language processing or time series analysis. |
diffusion_model, analysis, interpretability, neural_pathway |
2308.07648 |
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval |
Chaorui Deng, Qi Chen, Pengda Qin, Da Chen, Qi Wu |
In text-video retrieval, recent works have benefited from the powerful
learning capabilities of pre-trained text-image foundation models (e.g., CLIP)
by adapting them to the video domain. A critical problem for them is how to
effectively capture the rich semantics inside the video using the image encoder
of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal
modeling techniques to fuse the text information into video frame
representations, which, however, incurs severe efficiency issues in large-scale
retrieval systems as the video representations must be recomputed online for
every text query. In this paper, we discard this problematic cross-modal fusion
process and aim to learn semantically-enhanced representations purely from the
video, so that the video representations can be computed offline and reused for
different texts. Concretely, we first introduce a spatial-temporal "Prompt
Cube" into the CLIP image encoder and iteratively switch it within the encoder
layers to efficiently incorporate the global video semantics into frame
representations. We then propose to apply an auxiliary video captioning
objective to train the frame representations, which facilitates the learning of
detailed video semantics by providing fine-grained guidance in the semantic
space. With a naive temporal fusion strategy (i.e., mean-pooling) on the
enhanced frame representations, we obtain state-of-the-art performances on
three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC. |
This paper proposes Prompt Switch, an efficient method for adapting the CLIP model for text-video retrieval by introducing a Prompt Cube mechanism to enhance the learning of global and detailed video semantics, achieving state-of-the-art performance while maintaining high efficiency. |
This paper addresses the efficiency bottleneck in existing CLIP-based text-video retrieval methods that rely on computationally expensive cross-modal fusion. It proposes a novel approach to enhance video representation learning within the CLIP framework, enabling efficient and effective retrieval by decoupling video and text modalities during inference. |
The authors introduce a Prompt Cube, a 3D tensor integrated into the CLIP image encoder. This cube undergoes a Prompt Switch operation, transposing its spatial and temporal dimensions before each self-attention layer to capture global video semantics. Additionally, an auxiliary video captioning objective is employed during training to enhance the learning of detailed video semantics. Finally, a simple mean pooling strategy is used on the enhanced frame representations to obtain the video representation. |
The proposed Prompt Switch method achieves state-of-the-art performance on three benchmark datasets (MSR-VTT, MSVD, LSMDC) for text-video retrieval, outperforming previous methods, especially under the text-agnostic temporal fusion setting. It demonstrates a significant improvement in efficiency compared to methods relying on cross-modal temporal fusion, making it more suitable for large-scale retrieval systems. |
The authors acknowledge that their captioning module is relatively simple and might benefit from more advanced architectures. For future work, they suggest exploring other pre-training tasks or incorporating external knowledge to further enhance the model's performance. |
clip, text-video retrieval, video representation learning, prompt learning, efficiency, auxiliary task learning, captioning |
2404.05607 |
A Training-Free Plug-and-Play Watermark Framework for Stable Diffusion |
Guokai Zhang, Lanjun Wang, Yuting Su, An-An Liu |
Nowadays, the family of Stable Diffusion (SD) models has gained prominence
for its high quality outputs and scalability. This has also raised security
concerns on social media, as malicious users can create and disseminate harmful
content. Existing approaches involve training components or entire SDs to embed
a watermark in generated images for traceability and responsibility
attribution. However, in the era of AI-generated content (AIGC), the rapid
iteration of SDs renders retraining with watermark models costly. To address
this, we propose a training-free plug-and-play watermark framework for SDs.
Without modifying any components of SDs, we embed diverse watermarks in the
latent space, adapting to the denoising process. Our experimental findings
reveal that our method effectively harmonizes image quality and watermark
invisibility. Furthermore, it performs robustly under various attacks. We also
have validated that our method is generalized to multiple versions of SDs, even
without retraining the watermark model. |
This paper introduces a training-free, plug-and-play watermarking framework for Stable Diffusion models, enabling the embedding of diverse watermarks in the latent space without requiring any retraining of the SD model itself. |
The paper addresses the growing concern of misuse of AI-generated content, particularly with the rapid evolution of SD models. The proposed framework provides a cost-efficient and adaptable solution for watermarking, ensuring traceability and responsibility attribution for generated images. |
The authors develop a watermark encoder-decoder architecture trained solely on the frozen VAE encoder-decoder component of SD. During inference, the compressed watermark is embedded into the latent code after denoising, minimizing impact on image quality. The framework's generalization ability is analyzed, and extensive experiments are conducted to evaluate its performance on various SD versions and under different attacks. |
The proposed framework demonstrates excellent watermark invisibility, achieving high PSNR and SSIM scores while minimally affecting image quality (even showing slight FID improvement). The watermark extraction quality is high, with NC exceeding 96%. The framework exhibits strong generalization across different SD versions (v1-1, v1-4, v1-5) without retraining and shows robustness against common image manipulations like blurring, cropping, and noise addition. |
The authors acknowledge limitations in handling high-angle rotations due to the watermark's spatial dependence. Future work could explore rotation-invariant watermarking techniques. Additionally, while the framework minimizes noticeable artifacts, some localized pixel variations might occur in specific samples, requiring further investigation. |
diffusion_model, stable diffusion, watermarking, training-free, plug-and-play, aigc, image_generation, robustness, latent_space |
2309.04372 |
MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers |
Sijia Li, Chen Chen, Haonan Lu |
Diffusion-model-based text-guided image generation has recently made
astounding progress, producing fascinating results in open-domain image
manipulation tasks. Few models, however, currently have complete zero-shot
capabilities for both global and local image editing due to the complexity and
diversity of image manipulation tasks. In this work, we propose a method with a
mixture-of-expert (MOE) controllers to align the text-guided capacity of
diffusion models with different kinds of human instructions, enabling our model
to handle various open-domain image manipulation tasks with natural language
instructions. First, we use large language models (ChatGPT) and conditional
image synthesis models (ControlNet) to generate a large number of global image
transfer dataset in addition to the instruction-based local image editing
dataset. Then, using an MOE technique and task-specific adaptation training on
a large-scale dataset, our conditional diffusion model can edit images globally
and locally. Extensive experiments demonstrate that our approach performs
surprisingly well on various image manipulation tasks when dealing with
open-domain images and arbitrary human instructions. Please refer to our
project page: [https://oppo-mente-lab.github.io/moe_controller/] |
This paper introduces MoEController, a novel method for arbitrary image manipulation guided by text instructions, tackling the challenge of performing both global and local image editing in a unified framework. |
This paper is important as it addresses the limitations of existing image manipulation methods that struggle to effectively handle both global and local edits based on open-domain text instructions. It proposes a novel approach using a mixture-of-expert (MOE) framework to enhance the model's adaptability to diverse image manipulation tasks. |
The authors first create a large-scale dataset for global image manipulation using ChatGPT to generate target captions and ControlNet to generate image pairs. They then design an MOE model with a fusion module, multiple expert models, and a gate system to discriminate between different instruction semantics and adapt to specific tasks. The model is trained with a reconstruction loss to ensure image entity consistency. |
MoEController demonstrates superior performance in both global and local image manipulation tasks compared to existing methods. It effectively handles complex style transfers, local edits, and object manipulations. Quantitative evaluations using CLIP metrics and user studies confirm its effectiveness and adaptability to open-domain instructions. |
The authors suggest extending MoEController to handle a wider range of human instructions and more complex image manipulation tasks in the future. Further exploration of expert model design and optimization of the gating mechanism could further improve performance. |
diffusion_model, image_manipulation, llm, moe, controlnet, chatgpt, global_editing, local_editing |
2404.06139 |
DiffHarmony: Latent Diffusion Model Meets Image Harmonization |
Pengfei Zhou, Fangxiang Feng, Xiaojie Wang |
Image harmonization, which involves adjusting the foreground of a composite
image to attain a unified visual consistency with the background, can be
conceptualized as an image-to-image translation task. Diffusion models have
recently promoted the rapid development of image-to-image translation tasks .
However, training diffusion models from scratch is computationally intensive.
Fine-tuning pre-trained latent diffusion models entails dealing with the
reconstruction error induced by the image compression autoencoder, making it
unsuitable for image generation tasks that involve pixel-level evaluation
metrics. To deal with these issues, in this paper, we first adapt a pre-trained
latent diffusion model to the image harmonization task to generate the
harmonious but potentially blurry initial images. Then we implement two
strategies: utilizing higher-resolution images during inference and
incorporating an additional refinement stage, to further enhance the clarity of
the initially harmonized images. Extensive experiments on iHarmony4 datasets
demonstrate the superiority of our proposed method. The code and model will be
made publicly available at https://github.com/nicecv/DiffHarmony . |
This paper introduces DiffHarmony, a novel image harmonization method that leverages a pre-trained latent diffusion model (Stable Diffusion) to generate harmonious images, enhanced by higher-resolution inference and a refinement stage to mitigate image distortion caused by the inherent compression in latent diffusion models. |
This paper is significant because it addresses the limitations of applying pre-trained latent diffusion models to image harmonization, particularly the reconstruction errors due to image compression. It offers a novel approach to achieve state-of-the-art results on image harmonization tasks by effectively adapting and enhancing the capabilities of pre-trained latent diffusion models. |
The authors adapt a pre-trained Stable Diffusion model for image harmonization by incorporating composite images and foreground masks as input conditions. To mitigate image distortion, they employ two strategies: using higher-resolution images during inference and adding a refinement stage using a UNet model. The method is evaluated on the iHarmony4 dataset using PSNR, MSE, and fMSE metrics and compared with other state-of-the-art methods. |
DiffHarmony achieves state-of-the-art results on the iHarmony4 dataset, demonstrating the effectiveness of the proposed approach. Notably, the method excels in harmonizing images with larger foreground regions. Higher-resolution inference significantly improves performance, and the refinement stage further enhances the quality of generated images. Additionally, the authors conducted an ablation study to analyze the contribution of each component and performed an advanced analysis comparing their method with a state-of-the-art model trained on higher-resolution images. |
The authors acknowledge that their method's performance on images with small foreground regions requires further investigation. Future work could explore using even higher image resolutions or employing better pre-trained diffusion models to address the limitations of information compression. Additionally, exploring alternative refinement techniques or more advanced network architectures for the refinement stage could lead to further improvements. |
diffusion_model, image_harmonization, stable_diffusion, image_generation, refinement, vae, image_distortion, high_resolution |
2312.03701 |
Return of Unconditional Generation: A Self-supervised Representation Generation Method |
Tianhong Li, Dina Katabi, Kaiming He |
Unconditional generation -- the problem of modeling data distribution without
relying on human-annotated labels -- is a long-standing and fundamental
challenge in generative models, creating a potential of learning from
large-scale unlabeled data. In the literature, the generation quality of an
unconditional method has been much worse than that of its conditional
counterpart. This gap can be attributed to the lack of semantic information
provided by labels. In this work, we show that one can close this gap by
generating semantic representations in the representation space produced by a
self-supervised encoder. These representations can be used to condition the
image generator. This framework, called Representation-Conditioned Generation
(RCG), provides an effective solution to the unconditional generation problem
without using labels. Through comprehensive experiments, we observe that RCG
significantly improves unconditional generation quality: e.g., it achieves a
new state-of-the-art FID of 2.15 on ImageNet 256x256, largely reducing the
previous best of 5.91 by a relative 64%. Our unconditional results are situated
in the same tier as the leading class-conditional ones. We hope these
encouraging observations will attract the community's attention to the
fundamental problem of unconditional generation. Code is available at
https://github.com/LTH14/rcg. |
This paper introduces Representation-Conditioned Generation (RCG), a novel framework for unconditional image generation that leverages self-supervised representations to guide the generation process, effectively closing the quality gap between unconditional and conditional generation. |
This paper is important because it addresses the long-standing challenge of poor-quality unconditional image generation compared to conditional methods. It proposes a method to leverage large-scale unlabeled datasets for training high-quality generative models by effectively utilizing self-supervised representations. |
The authors propose a three-stage approach: 1) a pre-trained self-supervised encoder maps images to a representation space; 2) a lightweight diffusion model learns to generate representations within this space; 3) a conditional image generator (e.g., ADM, DiT, or MAGE) generates images conditioned on these representations. |
RCG significantly improves unconditional generation quality across various image generators and datasets. It achieves state-of-the-art FID scores on ImageNet 256x256, surpassing previous unconditional methods and rivaling leading class-conditional methods. RCG also enables guidance in unconditional generation, further boosting performance. The method allows semantic interpolation by manipulating representations and can be easily extended to class-conditional generation. |
The paper mentions that while RCG excels in generating diverse and high-quality images, it still faces challenges in generating text, regular shapes, and realistic humans, similar to other ImageNet generative models. Future work could explore pre-training on larger unlabeled datasets and adapting to various downstream generative tasks with minimal overhead by training only the representation generator on small labeled datasets. |
diffusion_model, gan, unconditional generation, self-supervised representation, image generation, representation learning |
2403.15378 |
Long-CLIP: Unlocking the Long-Text Capability of CLIP |
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang |
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for
zero-shot classification, text-image retrieval, and text-image generation by
aligning image and text modalities. Despite its widespread adoption, a
significant limitation of CLIP lies in the inadequate length of text input. The
length of the text token is restricted to 77, and an empirical study shows the
actual effective length is even less than 20. This prevents CLIP from handling
detailed descriptions, limiting its applications for image retrieval and
text-to-image generation with extensive prerequisites. To this end, we propose
Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input,
retains or even surpasses its zero-shot generalizability, and aligns the CLIP
latent space, making it readily replace CLIP without any further adaptation in
downstream frameworks. Nevertheless, achieving this goal is far from
straightforward, as simplistic fine-tuning can result in a significant
degradation of CLIP's performance. Moreover, substituting the text encoder with
a language model supporting longer contexts necessitates pretraining with vast
amounts of data, incurring significant expenses. Accordingly, Long-CLIP
introduces an efficient fine-tuning solution on CLIP with two novel strategies
designed to maintain the original capabilities, including (1) a
knowledge-preserved stretching of positional embedding and (2) a primary
component matching of CLIP features. With leveraging just one million extra
long text-image pairs, Long-CLIP has shown the superiority to CLIP for about
20% in long caption text-image retrieval and 6% in traditional text-image
retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers
enhanced capabilities for generating images from detailed text descriptions by
replacing CLIP in a plug-and-play manner. |
The paper introduces Long-CLIP, an enhanced version of CLIP designed to address the limitation of short text input in the original model, enabling it to process longer and more detailed textual descriptions while retaining its zero-shot generalization capabilities. |
This work is important as it enables CLIP to handle detailed descriptions, thereby broadening its applicability in image retrieval, text-to-image generation, and other tasks requiring comprehensive textual understanding. This advancement holds the potential to significantly enhance the performance and versatility of CLIP-based applications. |
The authors propose two novel strategies: (1) Knowledge-Preserved Stretching: interpolating the positional embedding of less-trained positions while preserving the well-trained ones to support longer text input without disrupting the representation of short text positions; (2) Primary Component Matching: aligning both fine-grained image features with long captions and coarse-grained features (extracted using PCA) with short summary captions during fine-tuning to enable the model to capture detailed attributes and understand their importance. Long-CLIP is fine-tuned on the ShareGPT4V dataset, which contains image-long caption pairs. |
Long-CLIP demonstrates superior performance compared to the original CLIP in various tasks, including: (1) Long-text image retrieval: It significantly improves the recall rate by approximately 20% on datasets like ShareGPT4V and Urban. (2) Short-text image retrieval: It also shows improvement on benchmarks like COCO and Flickr30k. (3) Zero-shot classification: It retains comparable performance to CLIP on ImageNet and CIFAR. (4) Text-to-image generation: It exhibits a plug-and-play effect, enabling existing models like Stable Diffusion to generate images from detailed descriptions without additional training. |
The paper acknowledges that Long-CLIP, despite its improvements, still has a finite input length limit, although significantly extended. Future work could explore relative positional embeddings like RoPE to potentially overcome this limitation. Additionally, the authors suggest exploring the scaling-up potential of Long-CLIP by training with a larger dataset of long text-image pairs, as the current work only utilizes a relatively small portion of the ShareGPT4V dataset. |
diffusion_model, clip, analysis, image_retrieval, text-to-image_generation, interpretability |
2402.19427 |
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models |
Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre |
Recurrent neural networks (RNNs) have fast inference and scale efficiently on
long sequences, but they are difficult to train and hard to scale. We propose
Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that
mixes gated linear recurrences with local attention. Hawk exceeds the reported
performance of Mamba on downstream tasks, while Griffin matches the performance
of Llama-2 despite being trained on over 6 times fewer tokens. We also show
that Griffin can extrapolate on sequences significantly longer than those seen
during training. Our models match the hardware efficiency of Transformers
during training, and during inference they have lower latency and significantly
higher throughput. We scale Griffin up to 14B parameters, and explain how to
shard our models for efficient distributed training. |
This paper introduces Hawk and Griffin, two novel recurrent neural network architectures for language modeling that address the scalability limitations of traditional RNNs and offer advantages over Transformers on tasks involving long sequences. |
This paper is important as it presents a potential solution to the long-standing challenge of efficiently scaling RNNs for language modeling. The proposed models, Hawk and Griffin, demonstrate competitive performance with Transformers while exhibiting superior efficiency in handling long sequences, which is crucial for various NLP tasks. |
The authors developed Hawk, a pure RNN model based on the novel Real-Gated Linear Recurrent Unit (RG-LRU), and Griffin, a hybrid model combining RG-LRU with local attention. They conducted scaling experiments, training these models on the MassiveText dataset with up to 300B tokens, comparing their performance to Transformer baselines and state-of-the-art models like Mamba and Llama-2. They analyzed training efficiency on TPUs, inference speed, and capabilities in handling long contexts and performing tasks like copying and retrieval. |
Hawk and Griffin demonstrated power-law scaling in training, matching the efficiency of Transformers. Hawk-3B outperformed Mamba-3B on downstream tasks despite being trained on half the data, and Griffin-7B and Griffin-14B achieved comparable results to Llama-2 with significantly less training data. They also exhibited faster inference, especially on longer sequences, due to their smaller memory footprint compared to Transformers. Notably, both models showed superior performance in extrapolating to longer sequences than those seen during training. |
The authors acknowledge that while Griffin shows promise in copying and retrieval tasks, more research is needed to match the performance of Transformers in this domain, particularly when evaluating pre-trained models without fine-tuning. Future work could also involve exploring different local attention window sizes for Griffin, potentially dynamically adjusting them based on sequence length and hardware constraints. |
rnn, transformer, language_model, long_sequence, efficiency, inference, scaling, local_attention, copying, retrieval |
2402.13929 |
SDXL-Lightning: Progressive Adversarial Diffusion Distillation |
Shanchuan Lin, Anran Wang, Xiao Yang |
We propose a diffusion distillation method that achieves new state-of-the-art
in one-step/few-step 1024px text-to-image generation based on SDXL. Our method
combines progressive and adversarial distillation to achieve a balance between
quality and mode coverage. In this paper, we discuss the theoretical analysis,
discriminator design, model formulation, and training techniques. We
open-source our distilled SDXL-Lightning models both as LoRA and full UNet
weights. |
This paper introduces SDXL-Lightning, a novel diffusion distillation method that achieves state-of-the-art performance in one-step/few-step 1024px text-to-image generation based on SDXL. |
This paper is important because it addresses the limitations of existing diffusion models in generating high-quality images with few inference steps, offering a significant speed and computational advantage over previous methods. |
The authors propose a progressive adversarial diffusion distillation method. The approach combines progressive distillation with an adversarial loss function and uses a pre-trained diffusion UNet encoder as the discriminator backbone, enabling efficient distillation in latent space. The method progressively distills the model from 128 steps to 1 step, using both conditional and unconditional adversarial objectives to balance image quality and mode coverage. |
The resulting SDXL-Lightning models achieve state-of-the-art performance in one-step/few-step 1024px text-to-image generation, exceeding the quality of previous methods like SDXL-Turbo and LCM. The models demonstrate superior high-resolution detail preservation while maintaining comparable text alignment and diversity. Notably, they even surpass the original SDXL model in quality for 4-step and 8-step generation. |
The paper acknowledges limitations, including the need for separate checkpoints for different inference steps and the potential for further improvement in the UNet architecture for one-step generation. Future work could explore distilling models with multiple aspect ratios and researching optimal architectures for one-step generation. |
diffusion_model, gan, distillation, text-to-image, adversarial_training, image_generation, sdxl |
2311.04897 |
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State |
Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau |
We conjecture that hidden state vectors corresponding to individual input
tokens encode information sufficient to accurately predict several tokens
ahead. More concretely, in this paper we ask: Given a hidden (internal)
representation of a single token at position $t$ in an input, can we reliably
anticipate the tokens that will appear at positions $\geq t + 2$? To test this,
we measure linear approximation and causal intervention methods in GPT-J-6B to
evaluate the degree to which individual hidden states in the network contain
signal rich enough to predict future hidden states and, ultimately, token
outputs. We find that, at some layers, we can approximate a model's output with
more than 48% accuracy with respect to its prediction of subsequent tokens
through a single hidden state. Finally we present a "Future Lens" visualization
that uses these methods to create a new view of transformer states. |
This paper investigates whether hidden state vectors in large language models (LLMs) encode information sufficient to predict multiple tokens ahead, going beyond the typical one-token prediction. |
This research is significant because it probes the depth of information encoded within individual hidden states of LLMs, potentially revealing a deeper understanding of how these models process and retain information over longer spans of text. |
The authors test their hypothesis by employing three methods on GPT-J-6B: (1) training linear models to approximate future hidden states and decode them, (2) conducting causal intervention by transplanting hidden states to different contexts, and (3) training a "soft prompt" to optimize the extraction of subsequent token information from a hidden state. |
The study finds that individual hidden states, especially in middle layers, contain significant information about future tokens, going beyond immediate next-token predictions. Notably, the "learned prompt causal intervention" method achieves the highest accuracy in predicting subsequent tokens, even surpassing a bigram baseline. |
The authors acknowledge limitations regarding the training data size, the focus on a single LLM (GPT-J-6B), the lack of prior baselines for this specific task, and the limitation of predicting up to four tokens ahead. Future work could explore larger datasets, other LLMs, alternative baseline models (e.g., RNNs, Non-Autoregressive generation), and extend the prediction horizon beyond four tokens. |
llm, analysis, interpretability, transformer, hidden_state, causal_intervention |
2311.18158 |
HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation |
Yifan Zhang, Bryan Hooi |
Diffusion models have revolutionized text-to-image generation, but their
real-world applications are hampered by the extensive time needed for hundreds
of diffusion steps. Although progressive distillation has been proposed to
speed up diffusion sampling to 2-8 steps, it still falls short in one-step
generation, and necessitates training multiple student models, which is highly
parameter-extensive and time-consuming. To overcome these limitations, we
introduce High-frequency-Promoting Adaptation (HiPA), a parameter-efficient
approach to enable one-step text-to-image diffusion. Grounded in the insight
that high-frequency information is essential but highly lacking in one-step
diffusion, HiPA focuses on training one-step, low-rank adaptors to specifically
enhance the under-represented high-frequency abilities of advanced diffusion
models. The learned adaptors empower these diffusion models to generate
high-quality images in just a single step. Compared with progressive
distillation, HiPA achieves much better performance in one-step text-to-image
generation (37.3 $\rightarrow$ 23.8 in FID-5k on MS-COCO 2017) and 28.6x
training speed-up (108.8 $\rightarrow$ 3.8 A100 GPU days), requiring only 0.04%
training parameters (7,740 million $\rightarrow$ 3.3 million). We also
demonstrate HiPA's effectiveness in text-guided image editing, inpainting and
super-resolution tasks, where our adapted models consistently deliver
high-quality outputs in just one diffusion step. The source code will be
released. |
This paper introduces High-frequency-Promoting Adaptation (HiPA), a parameter-efficient method for accelerating text-to-image diffusion models to generate high-quality images in a single step by training low-rank adaptors that enhance the model's ability to generate high-frequency details. |
This paper is important because it addresses the limitations of existing text-to-image diffusion models, which require many diffusion steps and thus extensive processing time. HiPA provides a solution for real-time applications that rely on fast and high-quality image generation. |
The authors analyze the image generation process of existing diffusion models and find that one-step diffusion lacks high-frequency details crucial for realistic image synthesis. They then propose HiPA, which trains low-rank adaptors using a novel adaptation loss that combines a spatial perceptual loss and a high-frequency promoted loss. This approach encourages the model to generate images with enhanced high-frequency details in just one step. |
HiPA significantly outperforms previous one-step and few-step methods in terms of both image quality and training efficiency. Experiments on MS-COCO datasets demonstrate that HiPA achieves comparable results to multi-step diffusion models while being significantly faster. The method is also successfully applied to text-guided image editing, inpainting, and super-resolution, demonstrating its versatility for various real-world image generation tasks. |
The authors acknowledge that while HiPA significantly improves one-step generation, there is still room for further enhancement in image quality compared to multi-step diffusion models. They suggest exploring the adaptation of more advanced diffusion models, such as SD-XL and DALL-E3, as a future direction. Another limitation is the occasional presence of artifacts in the generated images, which the authors attribute, in part, to limitations inherited from the original multi-step models. As a potential solution, they propose using HiPA for generating quick drafts and then refining them using the original multi-step model for higher quality. |
diffusion_model, text-to-image, one-step generation, high-frequency, parameter-efficient, low-rank adaptation, image editing, inpainting, super-resolution |
2311.17002 |
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following |
Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou |
Existing text-to-image (T2I) diffusion models usually struggle in
interpreting complex prompts, especially those with quantity, object-attribute
binding, and multi-subject descriptions. In this work, we introduce a semantic
panel as the middleware in decoding texts to images, supporting the generator
to better follow instructions. The panel is obtained through arranging the
visual concepts parsed from the input text by the aid of large language models,
and then injected into the denoising network as a detailed control signal to
complement the text condition. To facilitate text-to-panel learning, we come up
with a carefully designed semantic formatting protocol, accompanied by a
fully-automatic data preparation pipeline. Thanks to such a design, our
approach, which we call Ranni, manages to enhance a pre-trained T2I generator
regarding its textual controllability. More importantly, the introduction of
the generative middleware brings a more convenient form of interaction (i.e.,
directly adjusting the elements in the panel or using language instructions)
and further allows users to finely customize their generation, based on which
we develop a practical system and showcase its potential in continuous
generation and chatting-based editing. Our project page is at
https://ranni-t2i.github.io/Ranni. |
This paper introduces Ranni, a text-to-image generation framework that enhances the controllability and accuracy of existing diffusion models by using a 'semantic panel' as a structured intermediary representation between text prompts and images. |
This paper is important because it addresses the limitations of current text-to-image models in interpreting complex prompts by introducing a novel semantic panel that facilitates better text-image alignment and offers more intuitive editing capabilities. |
The authors propose Ranni, a framework that leverages Large Language Models (LLMs) to translate text prompts into a structured 'semantic panel' containing visual concepts with attributes like bounding boxes, colors, and keypoints. This panel then guides a diffusion model to generate images that adhere more closely to the input text. They also introduce an automatic data preparation pipeline and conduct experiments on various prompts to evaluate Ranni's ability to follow instructions related to quantity, spatial relationships, attribute binding, and multiple objects. |
Ranni demonstrates superior performance in following complex prompts compared to existing methods, particularly in terms of quantity awareness and spatial relationship understanding. It also shows promise as a unified image creation system, enabling interactive editing through manual manipulation or LLM-driven instructions in a chat-based interface. |
The authors identify limitations, such as occasional inaccuracies in the initial semantic panel generation and the need for further exploration in controlling object appearance beyond bounding boxes. Future work could focus on improving the precision of the semantic panel, exploring alternative LLM architectures, and expanding the range of controllable attributes for enhanced editing capabilities. |
diffusion_model, llm, text-to-image, controllable_generation, semantic_panel, interactive_editing, chat-based_generation |
2405.08246 |
Compositional Text-to-Image Generation with Dense Blob Representations |
Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat |
Existing text-to-image models struggle to follow complex text prompts,
raising the need for extra grounding inputs for better controllability. In this
work, we propose to decompose a scene into visual primitives - denoted as dense
blob representations - that contain fine-grained details of the scene while
being modular, human-interpretable, and easy-to-construct. Based on blob
representations, we develop a blob-grounded text-to-image diffusion model,
termed BlobGEN, for compositional generation. Particularly, we introduce a new
masked cross-attention module to disentangle the fusion between blob
representations and visual features. To leverage the compositionality of large
language models (LLMs), we introduce a new in-context learning approach to
generate blob representations from text prompts. Our extensive experiments show
that BlobGEN achieves superior zero-shot generation quality and better
layout-guided controllability on MS-COCO. When augmented by LLMs, our method
exhibits superior numerical and spatial correctness on compositional image
generation benchmarks. Project page: https://blobgen-2d.github.io. |
This paper introduces BlobGEN, a text-to-image generation model that uses dense blob representations as grounding input to improve controllability and compositionality. |
This paper addresses the limitations of existing text-to-image models in following complex prompts and offers a modular, user-friendly approach to control image generation by decomposing scenes into semantically rich visual primitives. |
The authors propose dense blob representations, consisting of blob parameters (specifying location, size, orientation) and blob descriptions (text describing appearance), extracted using existing segmentation and captioning models. They develop a blob-grounded diffusion model with a novel masked cross-attention module to align blobs with corresponding visual features. Additionally, they introduce an in-context learning approach for LLMs to generate blob representations from text, enabling compositional generation. |
BlobGEN achieves superior zero-shot generation quality on MS-COCO, showing lower FID scores compared to baseline models. It exhibits strong layout-guided controllability, evidenced by higher region-level CLIP scores and successful object editing and repositioning capabilities. When augmented with LLMs, BlobGEN excels in compositional generation tasks, surpassing LayoutGPT in numerical and spatial accuracy on the NSR-1K benchmark. |
Limitations include the inability to perfectly reconstruct images solely from blobs, occasional failures in image editing, and robustness issues with LLM-generated blobs in compositional tasks. Future work could explore combining inversion methods for better reconstruction, advanced editing techniques to reduce editing failures, and improving the integration between LLMs and blob-grounded generation. |
diffusion_model, llm, compositional image generation, layout-guided generation, blob representation, masked cross-attention, zero-shot generation, image editing |
2404.07724 |
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models |
Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, Jaakko Lehtinen |
Guidance is a crucial technique for extracting the best performance out of
image-generating diffusion models. Traditionally, a constant guidance weight
has been applied throughout the sampling chain of an image. We show that
guidance is clearly harmful toward the beginning of the chain (high noise
levels), largely unnecessary toward the end (low noise levels), and only
beneficial in the middle. We thus restrict it to a specific range of noise
levels, improving both the inference speed and result quality. This limited
guidance interval improves the record FID in ImageNet-512 significantly, from
1.81 to 1.40. We show that it is quantitatively and qualitatively beneficial
across different sampler parameters, network architectures, and datasets,
including the large-scale setting of Stable Diffusion XL. We thus suggest
exposing the guidance interval as a hyperparameter in all diffusion models that
use guidance. |
This paper introduces a novel technique for enhancing image generation in diffusion models by strategically limiting the application of classifier-free guidance (CFG) to a specific interval of noise levels during the sampling process. |
This research is significant as it addresses the sub-optimal performance of traditional CFG, which applies a constant guidance weight throughout the sampling process, leading to limitations in image quality and inference speed. By confining CFG to a specific noise level range, this technique allows for higher guidance weights, resulting in substantial improvements in image fidelity and a reduction in computational cost. |
The authors begin by analyzing the impact of CFG at different noise levels using both a theoretical framework and empirical observations. They demonstrate that CFG is detrimental at high noise levels, largely unnecessary at low levels, and most beneficial in the middle stages of the sampling chain. Based on this insight, they propose a modified ODE for diffusion model sampling where guidance is applied only within a specific noise level interval. They evaluate their approach using quantitative metrics (FID, FDDINO) on ImageNet-512 and provide a qualitative analysis of the generated images using both ImageNet and Stable Diffusion XL. Ablation studies are performed to demonstrate the impact of varying guidance intervals and weights. |
The proposed method achieves state-of-the-art FID scores on ImageNet-512, surpassing previous records by a significant margin. Notably, with their method, FID improves from 2.23 to 1.68 using EDM2-S and from 1.81 to 1.40 using EDM2-XXL. Qualitative results demonstrate that limiting the guidance interval preserves image diversity and reduces color saturation artifacts commonly observed with high guidance weights in standard CFG. The technique is shown to be effective across different sampler parameters, network architectures, and datasets, including Stable Diffusion XL. |
The authors acknowledge that while their method significantly improves performance, future work could explore automatically determining the optimal guidance interval directly from the ODE. Additionally, further research is needed to understand the role of non-ideal, trained denoisers in the context of this technique. |
diffusion_model, image_generation, classifier-free_guidance, sampling, fid, imagenet, stable diffusion xl |
2310.11868 |
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now |
Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu |
The recent advances in diffusion models (DMs) have revolutionized the
generation of realistic and complex images. However, these models also
introduce potential safety hazards, such as producing harmful content and
infringing data copyrights. Despite the development of safety-driven unlearning
techniques to counteract these challenges, doubts about their efficacy persist.
To tackle this issue, we introduce an evaluation framework that leverages
adversarial prompts to discern the trustworthiness of these safety-driven DMs
after they have undergone the process of unlearning harmful concepts.
Specifically, we investigated the adversarial robustness of DMs, assessed by
adversarial prompts, when eliminating unwanted concepts, styles, and objects.
We develop an effective and efficient adversarial prompt generation approach
for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic
classification abilities of DMs to simplify the creation of adversarial
prompts, thereby eliminating the need for auxiliary classification or diffusion
models.Through extensive benchmarking, we evaluate the robustness of five
widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable
concepts, styles, or objects) across a variety of tasks. Our results
demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the
state-of-the-art adversarial prompt generation method and reveal the lack of
robustness of current safety-driven unlearning techniques when applied to DMs.
Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack.
WARNING: This paper contains model outputs that may be offensive in nature. |
This paper focuses on the safety of diffusion models (DMs) for image generation. It introduces an adversarial attack method, called Diffusion-MU-Attack, to assess the robustness of 'unlearned' DMs, which are designed to mitigate the generation of harmful or undesired images. |
This paper is important because it tackles the critical issue of safety in DMs, highlighting potential vulnerabilities in existing safety-driven approaches. It provides a valuable evaluation framework and a novel attack method to help improve the robustness and trustworthiness of DMs, especially important given their rapid adoption and potential for misuse. |
The authors develop an adversarial prompt generation method, leveraging the concept of a 'diffusion classifier' inherent in well-trained DMs. This method optimizes text prompts to circumvent the safety mechanisms of unlearned DMs, compelling them to generate images containing the erased content. They evaluate their attack against several state-of-the-art unlearned DMs across three unlearning tasks: concept, style, and object unlearning. The effectiveness of the attack is measured by its success rate in generating images classified as containing the unlearned concepts, styles, or objects. |
The results demonstrate the effectiveness of the proposed attack in bypassing the safety mechanisms of various unlearned DMs. Specifically, the attack successfully generates images classified as containing the erased concepts, styles, or objects with high success rates. Moreover, the attack is computationally efficient, as it does not require auxiliary diffusion or classification models. The results also reveal that current safety-driven unlearning techniques still lack robustness against adversarial prompt attacks. |
The authors acknowledge that their work primarily focuses on evaluating the robustness of unlearned DMs against adversarial prompts, leaving other attack vectors unexplored. They suggest future work could investigate the robustness against attacks on other aspects of DMs, such as the noise generation process or the latent image representation. Additionally, they emphasize the need for developing more robust unlearning methods for DMs to address the vulnerabilities exposed by their attack. |
diffusion_model, adversarial_attack, interpretability, safety, unlearning, machine_unlearning, robustness, text-to-image, image_generation |
2312.14135 |
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs |
Penghao Wu, Saining Xie |
When we look around and perform complex tasks, how we see and selectively
process what we see is crucial. However, the lack of this visual search
mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on
important visual details, especially when handling high-resolution and visually
crowded images. To address this, we introduce V*, an LLM-guided visual search
mechanism that employs the world knowledge in LLMs for efficient visual
querying. When combined with an MLLM, this mechanism enhances collaborative
reasoning, contextual understanding, and precise targeting of specific visual
elements. This integration results in a new MLLM meta-architecture, named Show,
sEArch, and TelL (SEAL). We further create V*Bench, a benchmark specifically
designed to evaluate MLLMs in their ability to process high-resolution images
and focus on visual details. Our study highlights the necessity of
incorporating visual search capabilities into multimodal systems. The code is
available https://github.com/penghao-wu/vstar. |
This paper introduces SEAL, a novel framework that integrates an LLM-guided visual search mechanism into Multimodal Large Language Models (MLLMs) to enhance their visual grounding capabilities, especially for high-resolution images where details are crucial. |
This paper addresses the limitations of current MLLMs in handling high-resolution images due to their reliance on pre-trained vision encoders with limited resolution and their inability to actively search for missing visual information. This is important as it highlights the need for a more human-like approach to visual processing in MLLMs, enabling them to handle more complex real-world scenarios. |
The authors propose SEAL, a meta-architecture consisting of a VQA LLM and a visual search model. The VQA LLM identifies missing visual information, and the visual search model, guided by the LLM's world knowledge, efficiently locates and adds these details to a Visual Working Memory (VWM), enabling the VQA LLM to provide more informed answers. They also introduce V$^*$Bench, a benchmark to evaluate MLLMs on detailed visual grounding in high-resolution images. |
The SEAL framework significantly outperforms existing open-source and commercial MLLMs on the V$^*$Bench, demonstrating the effectiveness of incorporating a visual search mechanism. Their ablation studies further validate the importance of their LLM-guided search strategy over simple detection-based approaches. Additionally, their analysis on the COCO-Search18 dataset shows that their LLM-guided visual search achieves efficiency comparable to human eye fixations during visual search tasks. |
The authors acknowledge that their visual search model is currently designed for natural images and common objects, requiring further adaptation for handling documents, diagrams, videos, or open-world scenarios. They suggest exploring architectural improvements like incorporating convolution-based models for more efficient processing of high-resolution images. |
diffusion_model, llm, analysis, 3d, video, interpretability |
2309.17400 |
Directly Fine-Tuning Diffusion Models on Differentiable Rewards |
Kevin Clark, Paul Vicol, Kevin Swersky, David J Fleet |
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method
for fine-tuning diffusion models to maximize differentiable reward functions,
such as scores from human preference models. We first show that it is possible
to backpropagate the reward function gradient through the full sampling
procedure, and that doing so achieves strong performance on a variety of
rewards, outperforming reinforcement learning-based approaches. We then propose
more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to
only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance
gradient estimates for the case when K=1. We show that our methods work well
for a variety of reward functions and can be used to substantially improve the
aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw
connections between our approach and prior work, providing a unifying
perspective on the design space of gradient-based fine-tuning algorithms. |
This paper introduces DRaFT, a family of methods for efficiently fine-tuning diffusion models to maximize differentiable reward functions, such as human preference scores, through backpropagation through the sampling process, leading to improved generation quality. |
This paper is important because it offers a more efficient and scalable alternative to reinforcement learning for aligning diffusion model outputs with human preferences or other complex objectives, which is crucial for deploying these models in real-world applications. |
The authors propose DRaFT, which backpropagates reward gradients through the sampling chain, using LoRA and gradient checkpointing for efficiency. They introduce two variants: DRaFT-K, truncating backpropagation to the last K steps, and DRaFT-LV, reducing gradient variance for K=1. They evaluate their methods on Stable Diffusion 1.4 using various reward functions, including aesthetic scores, human preferences (PickScore, HPSv2), and tasks like image compressibility and adversarial example generation. |
DRaFT significantly outperforms RL methods in sample efficiency for maximizing aesthetic scores. DRaFT-LV achieves the best reward value on the HPSv2 benchmark, learning faster than other methods. The authors demonstrate the effectiveness of DRaFT on various tasks like generating compressible/incompressible images, manipulating object presence using object detectors, and creating adversarial examples. They also show that LoRA scaling allows for controlling the strength of fine-tuning and combining models trained with different rewards. |
The paper acknowledges the issue of reward hacking, where models exploit reward function limitations. Future work could explore addressing reward hacking and developing more robust reward functions. The authors also point to improving text alignment using powerful image captioning models as a potential research direction. |
diffusion_model, reward_learning, fine-tuning, human_preference, aesthetic, lora, gradient_checkpointing, image_generation, adversarial_example, interpretability |
2401.10219 |
Edit One for All: Interactive Batch Image Editing |
Thao Nguyen, Utkarsh Ojha, Yuheng Li, Haotian Liu, Yong Jae Lee |
In recent years, image editing has advanced remarkably. With increased human
control, it is now possible to edit an image in a plethora of ways; from
specifying in text what we want to change, to straight up dragging the contents
of the image in an interactive point-based manner. However, most of the focus
has remained on editing single images at a time. Whether and how we can
simultaneously edit large batches of images has remained understudied. With the
goal of minimizing human supervision in the editing process, this paper
presents a novel method for interactive batch image editing using StyleGAN as
the medium. Given an edit specified by users in an example image (e.g., make
the face frontal), our method can automatically transfer that edit to other
test images, so that regardless of their initial state (pose), they all arrive
at the same final state (e.g., all facing front). Extensive experiments
demonstrate that edits performed using our method have similar visual quality
to existing single-image-editing methods, while having more visual consistency
and saving significant time and human effort. |
This paper introduces a novel method for interactive batch image editing using StyleGAN, enabling the automatic transfer of user-specified edits from an example image to a batch of test images while maintaining consistency in the final edited state. |
This paper addresses the limitations of existing image editing techniques that primarily focus on single-image editing. It introduces the concept of interactive batch image editing, which significantly reduces human effort and time required for editing large image datasets while ensuring consistent results across images. |
The authors propose a two-stage approach. First, they model the user's edit in the latent space of StyleGAN by optimizing an editing direction that captures the desired change while being globally consistent across images. Second, they derive a closed-form solution to adjust the editing strength for each test image, ensuring that all edited images converge to the same final state as the user-edited example. |
The proposed method demonstrates superior performance in transferring various edits, such as point-based dragging and text-driven modifications, across different object categories like faces, animals, and human bodies. It achieves comparable visual quality to state-of-the-art single-image editing methods while being significantly faster and requiring minimal user annotation. |
The authors acknowledge limitations in capturing fine-grained details and handling semantic discrepancies between the example and test images. Future work includes extending the approach to diffusion-based models for wider edit types and addressing limitations related to out-of-distribution samples. |
diffusion_model, gan, image_editing, stylegan, batch_processing |
2405.05967 |
Distilling Diffusion Models into Conditional GANs |
Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park |
We propose a method to distill a complex multistep diffusion model into a
single-step conditional GAN student model, dramatically accelerating inference,
while preserving image quality. Our approach interprets diffusion distillation
as a paired image-to-image translation task, using noise-to-image pairs of the
diffusion model's ODE trajectory. For efficient regression loss computation, we
propose E-LatentLPIPS, a perceptual loss operating directly in diffusion
model's latent space, utilizing an ensemble of augmentations. Furthermore, we
adapt a diffusion model to construct a multi-scale discriminator with a text
alignment loss to build an effective conditional GAN-based formulation.
E-LatentLPIPS converges more efficiently than many existing distillation
methods, even accounting for dataset construction costs. We demonstrate that
our one-step generator outperforms cutting-edge one-step diffusion distillation
models - DMD, SDXL-Turbo, and SDXL-Lightning - on the zero-shot COCO benchmark. |
This paper introduces Diffusion2GAN, a method for distilling complex multi-step diffusion models into single-step conditional GANs, accelerating inference while preserving image quality by interpreting the process as a paired image-to-image translation task. |
This paper is important because it addresses the slow inference speed of diffusion models, a major limitation hindering their real-time application in areas like text-to-image synthesis, 3D modeling, and video generation. By enabling one-step generation without significant quality loss, it paves the way for more practical and interactive applications of these powerful models. |
The authors formulate diffusion distillation as a paired image-to-image translation problem, utilizing noise-to-image pairs from the diffusion model's ODE trajectory. They introduce E-LatentLPIPS, an efficient perceptual loss operating directly in the diffusion model's latent space, for effective regression. A multi-scale conditional discriminator with text alignment loss is also employed for enhanced performance. |
Diffusion2GAN outperforms state-of-the-art one-step diffusion distillation models (DMD, SDXL-Turbo, SDXL-Lightning) on zero-shot COCO benchmarks. E-LatentLPIPS demonstrates superior efficiency compared to traditional LPIPS, enabling larger batch sizes. The method's effectiveness is shown by distilling both Stable Diffusion 1.5 and the larger SDXL model, achieving impressive FID and CLIP scores. |
The paper acknowledges limitations in handling varying classifier-free guidance scales and the performance dependency on the teacher model. Future work could explore guided distillation techniques for CFG flexibility and leveraging real image-text pairs for surpassing teacher model limitations. Additionally, further investigation is needed to address the diversity drop observed when scaling up models. |
diffusion_model, gan, distillation, image_generation, text-to-image, perceptual_loss, latent_space, one-step_generation, inference_speed |
2310.17513 |
The Expressive Power of Low-Rank Adaptation |
Yuchen Zeng, Kangwook Lee |
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that
leverages low-rank adaptation of weight matrices, has emerged as a prevalent
technique for fine-tuning pre-trained models such as large language models and
diffusion models. Despite its huge success in practice, the theoretical
underpinnings of LoRA have largely remained unexplored. This paper takes the
first step to bridge this gap by theoretically analyzing the expressive power
of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any
model $f$ to accurately represent any smaller target model $\overline{f}$ if
LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of
}\overline{f}}{\text{depth of }f}$. We also quantify the approximation error
when LoRA-rank is lower than the threshold. For Transformer networks, we show
any model can be adapted to a target model of the same size with
rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters. |
This paper provides the first theoretical analysis of the expressive power of Low-Rank Adaptation (LoRA) for adapting pre-trained Fully Connected Neural Networks (FNN) and Transformer Networks (TFN). It identifies the necessary LoRA-rank for exactly adapting a frozen model to match a target model and quantifies the approximation error when the LoRA-rank is lower than the required threshold. |
This paper is important because it provides the first known theoretical results on the expressive power of LoRA, a widely used and successful fine-tuning method. The findings contribute to understanding why LoRA is effective and offer insights for hyperparameter tuning and algorithm development. |
The authors used a theoretical approach, starting with linear model approximation as a simplified scenario and extending the results to FNN and TFN with ReLU activation and softmax. They identified the required LoRA-rank by proving the existence of low-rank adapters that enable the adapted model to precisely match or approximate the target model under certain assumptions. The theoretical findings are validated by experiments on both synthetic and real datasets. |
Key findings include: (1) LoRA can adapt any FNN to exactly represent any smaller target FNN if the LoRA-rank meets a certain threshold. (2) For TFNs, any model can be adapted to a target model of the same size with a rank equal to half the embedding size. (3) In both linear and FNN settings, the total number of parameters needed to achieve an exact approximation is constant regardless of the LoRA-rank assignment across layers. (4) LoRA can adapt randomly generated models to match the target model with fewer parameters than final layer tuning. |
Limitations include the potential suboptimality of the constructed LoRA adapters, the lack of approximation error quantification for TFNs when the rank is lower than required, and the simplification of TFN architecture. Future work includes quantifying approximation errors for TFNs with insufficient ranks, refining LoRA adapter update algorithms, and studying LoRA's expressive power under more general TFN architecture settings. |
lora, fine-tuning, fnn, tfn, analysis, expressive_power, approximation_error |
2311.17042 |
Adversarial Diffusion Distillation |
Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach |
We introduce Adversarial Diffusion Distillation (ADD), a novel training
approach that efficiently samples large-scale foundational image diffusion
models in just 1-4 steps while maintaining high image quality. We use score
distillation to leverage large-scale off-the-shelf image diffusion models as a
teacher signal in combination with an adversarial loss to ensure high image
fidelity even in the low-step regime of one or two sampling steps. Our analyses
show that our model clearly outperforms existing few-step methods (GANs, Latent
Consistency Models) in a single step and reaches the performance of
state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first
method to unlock single-step, real-time image synthesis with foundation models.
Code and weights available under
https://github.com/Stability-AI/generative-models and
https://huggingface.co/stabilityai/ . |
This paper introduces Adversarial Diffusion Distillation (ADD), a novel approach for training diffusion models that generates high-quality images in just 1-4 sampling steps by combining adversarial training with score distillation from a pre-trained diffusion model. |
This paper is important because it addresses the limitations of current diffusion models, particularly their slow inference speed due to the iterative sampling process, and offers a method for achieving real-time, high-quality image synthesis using foundation models. |
The authors train a student diffusion model using a hybrid loss function consisting of two components: an adversarial loss that forces the model to generate realistic images and a score distillation loss that leverages the knowledge of a pre-trained teacher diffusion model. The model is trained to generate images from noisy inputs at various timesteps, using the same diffusion coefficients as the student model. |
ADD outperforms existing few-step methods like Latent Consistency Models (LCMs) and GANs in single-step image synthesis. Notably, with four sampling steps, ADD-XL surpasses the performance of its teacher model, SDXL-Base, demonstrating its capability to generate high-fidelity images efficiently. |
The authors acknowledge the potential for exploring different distillation weighting functions and scheduling strategies for further performance improvement. Future work could also involve investigating the application of ADD to other domains such as video and 3D generation. |
diffusion_model, gan, distillation, image_generation, real-time, adversarial_training, score_distillation |
2312.04837 |
Localized Symbolic Knowledge Distillation for Visual Commonsense Models |
Jae Sung Park, Jack Hessel, Khyathi Raghavi Chandu, Paul Pu Liang, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi, Yejin Choi |
Instruction following vision-language (VL) models offer a flexible interface
that supports a broad range of multimodal tasks in a zero-shot fashion.
However, interfaces that operate on full images do not directly enable the user
to "point to" and access specific regions within images. This capability is
important not only to support reference-grounded VL benchmarks, but also, for
practical applications that require precise within-image reasoning. We build
Localized Visual Commonsense models, which allow users to specify (multiple)
regions as input. We train our model by sampling localized commonsense
knowledge from a large language model (LLM): specifically, we prompt an LLM to
collect commonsense knowledge given a global literal image description and a
local literal region description automatically generated by a set of VL models.
With a separately trained critic model that selects high-quality examples, we
find that training on the localized commonsense corpus can successfully distill
existing VL models to support a reference-as-input interface. Empirical results
and human evaluations in a zero-shot setup demonstrate that our distillation
method results in more precise VL models of reasoning compared to a baseline of
passing a generated referring expression to an LLM. |
This paper introduces Localized Symbolic Knowledge Distillation (LSKD), a method for generating a localized visual commonsense corpus by prompting a large language model (LLM) with global and local image descriptions. This corpus is then used to train vision-language models that can accept region references as input, enabling more precise and context-aware reasoning within images. |
This paper is important because it addresses the limitations of existing vision-language models in performing localized reasoning within images. By enabling users to specify regions of interest, it paves the way for more intuitive and precise multimodal interactions. Furthermore, the paper demonstrates that machine-generated, localized visual commonsense corpora can be as effective as human-annotated datasets, opening new avenues for scalable and cost-effective model training. |
The authors propose a multi-stage approach: 1) Image-to-text verbalization of global image content, local region descriptions, and dynamic question-answer pairs. 2) Prompting an LLM (ChatGPT) to generate localized commonsense knowledge in a question-answer-rationale format. 3) Training a supervised critic model to filter out erroneous or low-quality generated instances. 4) Fine-tuning vision-language models (e.g., BLIP-2) on the filtered corpus for both discriminative and generative localized visual reasoning tasks. |
Key findings include: 1) Training on the LSKD corpus significantly improves the performance of vision-language models on localized visual reasoning benchmarks, outperforming baselines and even surpassing models trained on human-annotated data in some cases. 2) A supervised critic model effectively filters out erroneous instances, leading to improved downstream task performance. 3) Generative models fine-tuned with LSKD show promising results in localized question-answering, demonstrating the potential for more interactive and human-like multimodal communication. |
The authors acknowledge limitations such as the potential for verbalizer errors and the coverage of question types in the generated corpus. Future work could focus on developing more robust verbalization techniques, expanding the diversity of question types, and exploring more sophisticated critic models to further enhance the quality and coverage of the generated knowledge. |
diffusion_model, llm, analysis, 3d, video, interpretability |
2403.13043 |
When Do We Not Need Larger Vision Models? |
Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell |
Scaling up the size of vision models has been the de facto standard to obtain
more powerful visual representations. In this work, we discuss the point beyond
which larger vision models are not necessary. First, we demonstrate the power
of Scaling on Scales (S$^2$), whereby a pre-trained and frozen smaller vision
model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform
larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth
estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation.
Notably, S$^2$ achieves state-of-the-art performance in detailed understanding
of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the
conditions under which S$^2$ is a preferred scaling approach compared to
scaling on model size. While larger models have the advantage of better
generalization on hard examples, we show that features of larger vision models
can be well approximated by those of multi-scale smaller models. This suggests
most, if not all, of the representations learned by current large pre-trained
models can also be obtained from multi-scale smaller models. Our results show
that a multi-scale smaller model has comparable learning capacity to a larger
model, and pre-training smaller models with S$^2$ can match or even exceed the
advantage of larger models. We release a Python package that can apply S$^2$ on
any vision model with one line of code:
https://github.com/bfshi/scaling_on_scales. |
This paper explores the concept of "Scaling on Scales" (S^2) as a competitive alternative to increasing model size for enhancing visual representation in vision models, demonstrating that smaller models, when applied to multiple image scales, can outperform larger models in tasks like classification, segmentation, and depth estimation. |
This paper challenges the prevailing assumption that larger models are always necessary for better visual understanding, proposing a more efficient scaling method that achieves comparable or superior performance with fewer parameters and similar computational cost, which has significant implications for research directions and resource allocation. |
The authors introduce "S^2-Wrapper," a parameter-free mechanism extending pre-trained models to multi-scale feature extraction by splitting larger images into smaller sub-images and processing them independently before merging, and then conduct extensive experiments comparing S^2 with model size scaling across various tasks and datasets, including ImageNet, ADE20k, NYUv2, and robotic manipulation. |
The key finding is that smaller models with S^2 scaling often match or surpass larger models in performance across various tasks, particularly excelling in dense prediction tasks such as segmentation and depth estimation, and achieving state-of-the-art performance in multimodal LLM visual detail understanding by scaling image resolution to 1008^2. |
Limitations include the weaker generalization of smaller models pre-trained on a single scale compared to larger models on hard examples, and future work points towards exploring scale-selective processing for efficiency and enabling parallel processing of a single image for latency-critical scenarios. |
diffusion_model, llm, analysis, 3d, motion, video, interpretability |
2308.07686 |
Boosting Multi-modal Model Performance with Adaptive Gradient Modulation |
Hong Li, Xingyu Li, Pengbo Hu, Yinuo Lei, Chunxiao Li, Yi Zhou |
While the field of multi-modal learning keeps growing fast, the deficiency of
the standard joint training paradigm has become clear through recent studies.
They attribute the sub-optimal performance of the jointly trained model to the
modality competition phenomenon. Existing works attempt to improve the jointly
trained model by modulating the training process. Despite their effectiveness,
those methods can only apply to late fusion models. More importantly, the
mechanism of the modality competition remains unexplored. In this paper, we
first propose an adaptive gradient modulation method that can boost the
performance of multi-modal models with various fusion strategies. Extensive
experiments show that our method surpasses all existing modulation methods.
Furthermore, to have a quantitative understanding of the modality competition
and the mechanism behind the effectiveness of our modulation method, we
introduce a novel metric to measure the competition strength. This metric is
built on the mono-modal concept, a function that is designed to represent the
competition-less state of a modality. Through systematic investigation, our
results confirm the intuition that the modulation encourages the model to rely
on the more informative modality. In addition, we find that the jointly trained
model typically has a preferred modality on which the competition is weaker
than other modalities. However, this preferred modality need not dominate
others. Our code will be available at
https://github.com/lihong2303/AGM_ICCV2023. |
This paper proposes Adaptive Gradient Modulation (AGM), a novel method for enhancing the performance of multi-modal learning models by adaptively controlling the gradient flow during training to mitigate modality competition. |
This work is important because it addresses the sub-optimal performance of standard joint training in multi-modal learning, particularly the issue of modality competition where a dominant modality hinders the learning of other modalities. It provides a novel solution (AGM) applicable to various fusion strategies and offers insights into the dynamics of modality competition. |
The authors develop AGM, which utilizes Shapley value-based attribution to isolate mono-modal responses and adaptively modulates the gradients of individual modalities during back-propagation. They introduce the concept of "mono-modal concept" to represent the ideal, competition-less state of a modality and use it to quantify the competition strength. Experiments are conducted on five multi-modal datasets (AV-MNIST, CREMA-D, UR-Funny, AVE, CMU-MOSEI) with varying fusion strategies, modalities, and network architectures to evaluate AGM's effectiveness and analyze modality competition. |
The key findings demonstrate that AGM consistently outperforms existing modulation methods and significantly improves multi-modal models' accuracy across different datasets and architectures. The analysis reveals that AGM encourages models to leverage more informative modalities and mitigates the model's inherent bias towards specific modalities during training. The paper also establishes that modality competition is prevalent in multi-modal models, often with a "preferred modality" that the model tends to exploit. The strength of modality competition is found to be largely independent of the fusion strategy and modality type but appears to be influenced by the specific task and data characteristics. |
The paper acknowledges the need for further investigation into the relationship between modality competition strength, modality information content, and data characteristics. Future work could explore more sophisticated methods for defining and utilizing the "mono-modal concept" and investigate the role of higher-order interactions among modalities in shaping competition dynamics. |
diffusion_model, analysis, multi-modal learning, modality competition, gradient modulation, shapley value, fusion strategies |
2308.08089 |
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory |
Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan |
Controllable video generation has gained significant attention in recent
years. However, two main limitations persist: Firstly, most existing works
focus on either text, image, or trajectory-based control, leading to an
inability to achieve fine-grained control in videos. Secondly, trajectory
control research is still in its early stages, with most experiments being
conducted on simple datasets like Human3.6M. This constraint limits the models'
capability to process open-domain images and effectively handle complex curved
trajectories. In this paper, we propose DragNUWA, an open-domain
diffusion-based video generation model. To tackle the issue of insufficient
control granularity in existing works, we simultaneously introduce text, image,
and trajectory information to provide fine-grained control over video content
from semantic, spatial, and temporal perspectives. To resolve the problem of
limited open-domain trajectory control in current research, We propose
trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable
open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to
control trajectories in different granularities, and an Adaptive Training (AT)
strategy to generate consistent videos following trajectories. Our experiments
validate the effectiveness of DragNUWA, demonstrating its superior performance
in fine-grained control in video generation. The homepage link is
\url{https://www.microsoft.com/en-us/research/project/dragnuwa/} |
DragNUWA is an open-domain, diffusion-based video generation model that introduces fine-grained control over video content using text, image, and trajectory inputs, focusing on addressing limitations in trajectory control for open-domain scenarios. |
This paper is important as it tackles two key limitations in existing controllable video generation models: lack of fine-grained control and limited ability to handle complex trajectories in open-domain settings. DragNUWA's innovative approach, including Trajectory Sampler, Multiscale Fusion, and Adaptive Training, allows for more comprehensive and user-friendly control over video generation, opening new avenues for creative applications. |
DragNUWA leverages a diffusion-based model with a multi-stage training process. First, it uses a Trajectory Sampler to extract diverse trajectories from open-domain videos. Then, a Multiscale Fusion module integrates text, image, and trajectory data at different resolutions within the UNet architecture. Finally, Adaptive Training progressively adapts the model from dense optical flow conditions to user-defined sparse trajectories, ensuring stability and consistency in video generation. |
DragNUWA demonstrates superior performance in fine-grained video generation. It can effectively control complex object trajectories, including curved paths and varying motion amplitudes, as well as handle camera movements like zooming in and out. The model highlights the importance of combining text, image, and trajectory inputs for achieving comprehensive control over semantic, spatial, and temporal aspects of video content. |
The paper does not explicitly mention limitations but implies that incorporating video as a condition is beyond the scope of this research. Future work could explore the integration of video conditions for potential advancements in style transfer. Additionally, the paper primarily focuses on visual fidelity and controllability; investigating and improving the model's ability to generate temporally consistent and logically sound narratives could be a valuable direction for future research. |
diffusion_model, video, motion, controllable_generation, trajectory, open-domain |
2312.10835 |
Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models |
Nikita Starodubcev, Artem Fedorov, Artem Babenko, Dmitry Baranchuk |
Knowledge distillation methods have recently shown to be a promising
direction to speedup the synthesis of large-scale diffusion models by requiring
only a few inference steps. While several powerful distillation methods were
recently proposed, the overall quality of student samples is typically lower
compared to the teacher ones, which hinders their practical usage. In this
work, we investigate the relative quality of samples produced by the teacher
text-to-image diffusion model and its distilled student version. As our main
empirical finding, we discover that a noticeable portion of student samples
exhibit superior fidelity compared to the teacher ones, despite the
"approximate" nature of the student. Based on this finding, we propose an
adaptive collaboration between student and teacher diffusion models for
effective text-to-image synthesis. Specifically, the distilled model produces
the initial sample, and then an oracle decides whether it needs further
improvements with a slow teacher model. Extensive experiments demonstrate that
the designed pipeline surpasses state-of-the-art text-to-image alternatives for
various inference budgets in terms of human preference. Furthermore, the
proposed approach can be naturally used in popular applications such as
text-guided image editing and controllable generation. |
This paper explores a novel approach to text-to-image synthesis using an adaptive collaboration framework between a distilled student diffusion model and a teacher diffusion model. |
The paper addresses the limitations of existing distillation methods for diffusion models, which often compromise image quality while achieving faster inference. The proposed approach aims to combine the efficiency of distilled models with the high fidelity of teacher models, potentially leading to a new paradigm in text-to-image generation. |
The authors first analyze the performance of distilled text-to-image models and observe that a significant portion of the samples generated by students can be superior to the teacher. Based on this, they propose an adaptive pipeline where the student model generates an initial sample. An oracle, implemented using an image quality estimator (ImageReward), then decides whether to refine this sample further using the teacher model. This decision is made based on a learned threshold. The refinement process can be either a regeneration of the sample from scratch using the teacher or a refinement of the student's output. |
The proposed adaptive collaboration framework outperforms existing text-to-image baselines in terms of both human preference and automated metrics (FID, CLIP score, ImageReward) under various inference budgets. The method achieves a 2.5x to 5x speedup compared to standard diffusion models while maintaining or even surpassing their quality. Furthermore, the approach is successfully applied to text-guided image editing and controllable generation tasks, demonstrating its versatility and potential for broader applications. |
The authors acknowledge the limitations of current automated image quality estimators as a potential bottleneck for their approach. Future work could focus on developing more accurate estimators that better correlate with human preferences. Additionally, investigating the applicability of other fast text-to-image generation methods besides distillation, such as GANs, within their adaptive framework is suggested. |
diffusion_model, gan, analysis, image_generation, knowledge_distillation, text-to-image |
2308.16582 |
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images |
Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu |
Stable diffusion, a generative model used in text-to-image synthesis,
frequently encounters resolution-induced composition problems when generating
images of varying sizes. This issue primarily stems from the model being
trained on pairs of single-scale images and their corresponding text
descriptions. Moreover, direct training on images of unlimited sizes is
unfeasible, as it would require an immense number of text-image pairs and
entail substantial computational expenses. To overcome these challenges, we
propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to
efficiently generate well-composed images of any size, while minimizing the
need for high-memory GPU resources. Specifically, the initial stage, dubbed Any
Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a
restricted range of ratios to optimize the text-conditional diffusion model,
thereby improving its ability to adjust composition to accommodate diverse
image sizes. To support the creation of images at any desired size, we further
introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the
subsequent stage. This method allows for the rapid enlargement of the ASD
output to any high-resolution size, avoiding seaming artifacts or memory
overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks
demonstrate that ASD can produce well-structured images of arbitrary sizes,
cutting down the inference time by 2x compared to the traditional tiled
algorithm. |
This paper introduces Any-Size-Diffusion (ASD), a two-stage pipeline designed to generate well-composed images of arbitrary sizes from text prompts, addressing the resolution-induced composition problems in existing text-to-image synthesis models. |
This paper is important because it tackles the limitation of existing text-to-image models like Stable Diffusion, which often struggle to maintain good composition when generating images at different resolutions. The proposed ASD model allows for flexible image size generation while preserving compositional quality, significantly enhancing the capabilities of text-to-image synthesis. |
The ASD pipeline works in two stages: 1) Any Ratio Adaptability Diffusion (ARAD) is trained on multi-aspect ratio images to generate an image based on text prompt and size, minimizing composition issues. 2) Fast Seamless Tiled Diffusion (FSTD) enlarges the ARAD output to any desired size using a novel implicit overlap technique during tiled sampling, ensuring both speed and seamless image magnification. |
ASD demonstrates superior performance in generating well-composed images of arbitrary sizes, confirmed through quantitative and qualitative evaluation. Experiments show ASD achieves a 33.49 reduction in FID score compared to the baseline Stable Diffusion model and generates images up to 9 times higher resolution on the same hardware. The implicit overlap in FSTD effectively addresses seaming artifacts common in tiled diffusion methods, achieving high-fidelity image magnification while maintaining a speed comparable to non-overlapping tiling. |
The paper acknowledges a potential limitation in the computational cost associated with increasing the number of tiles in FSTD for higher resolutions. Future work could explore optimization strategies to mitigate this, further enhancing the model's efficiency. Additionally, the authors suggest exploring the application of ASD in other domains such as video generation and 3D object synthesis. |
diffusion_model, text-to-image, image_synthesis, super-resolution, compositionality, tiled_diffusion |
2402.12354 |
LoRA+: Efficient Low Rank Adaptation of Large Models |
Soufiane Hayou, Nikhil Ghosh, Bin Yu |
In this paper, we show that Low Rank Adaptation (LoRA) as originally
introduced in Hu et al. (2021) leads to suboptimal finetuning of models with
large width (embedding dimension). This is due to the fact that adapter
matrices A and B in LoRA are updated with the same learning rate. Using scaling
arguments for large width networks, we demonstrate that using the same learning
rate for A and B does not allow efficient feature learning. We then show that
this suboptimality of LoRA can be corrected simply by setting different
learning rates for the LoRA adapter matrices A and B with a well-chosen ratio.
We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$
improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$
2X SpeedUp), at the same computational cost as LoRA. |
This paper investigates the efficiency of Low Rank Adaptation (LoRA) for finetuning large language models and identifies suboptimal feature learning when using the same learning rate for adapter matrices A and B, especially in models with large embedding dimensions. |
The paper is important because it provides theoretical insights into the optimal setting of learning rates for LoRA, a widely used technique for efficient finetuning of large language models, and proposes a simple yet effective improvement called LoRA+. |
The authors utilize scaling arguments, analyzing the behavior of LoRA in the infinite-width limit. They study a simplified linear model and then extend their analysis to general neural architectures with LoRA layers, demonstrating the inefficiency of using equal learning rates for A and B and deriving optimal scaling rules for these learning rates. |
The key finding is that setting different learning rates for the LoRA adapter matrices A and B, specifically η_A = Θ(n^-1) and η_B = Θ(1), leads to efficient feature learning in the infinite-width limit. Empirically, they show that LoRA+ with a learning rate ratio of η_B/η_A ≈ 2^4 consistently improves finetuning speed and performance on various tasks and language models, including GPT-2, RoBERTa, and LLama-7b. |
The paper acknowledges limitations in precisely determining the optimal learning rate ratio for different tasks and models, suggesting that the ratio is task and model dependent. Future work could involve a more refined analysis to estimate the optimal ratio based on task and model characteristics, potentially leading to further performance improvements. |
diffusion_model, llm, analysis, finetuning, lora, optimization |
2311.10093 |
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models |
Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski |
Recent advances in text-to-image generation models have unlocked vast
potential for visual creativity. However, these models struggle with generation
of consistent characters, a crucial aspect for numerous real-world applications
such as story visualization, game development asset design, advertising, and
more. Current methods typically rely on multiple pre-existing images of the
target character or involve labor-intensive manual processes. In this work, we
propose a fully automated solution for consistent character generation, with
the sole input being a text prompt. We introduce an iterative procedure that,
at each stage, identifies a coherent set of images sharing a similar identity
and extracts a more consistent identity from this set. Our quantitative
analysis demonstrates that our method strikes a better balance between prompt
alignment and identity consistency compared to the baseline methods, and these
findings are reinforced by a user study. To conclude, we showcase several
practical applications of our approach. Project page is available at
https://omriavrahami.com/the-chosen-one |
The paper proposes a fully automated method for generating consistent characters in different contexts using text-to-image diffusion models, taking only a text prompt as input. |
This paper addresses a crucial limitation in current text-to-image models: the inability to generate consistent characters across various scenes, which is important for storytelling, game development, and other creative applications. The proposed method offers a fully automated solution, unlike existing manual or limited approaches. |
The method iteratively refines the representation of a character. It generates a gallery of images from a text prompt, embeds them in a feature space (using DINOv2), clusters the embeddings, and chooses the most cohesive cluster. This cluster is used to personalize a text-to-image model (SDXL) via textual inversion and LoRA, yielding a refined character representation. The process is repeated until convergence, ensuring consistent character generation in diverse contexts. |
The method effectively balances prompt adherence and identity consistency compared to baselines like Textual Inversion, LoRA DreamBooth, ELITE, BLIP-diffusion, and IP-adapter. Quantitative analysis and a user study confirm its effectiveness in generating diverse depictions of consistent characters. |
The authors acknowledge limitations such as occasional inconsistencies in identity, challenges with consistent supporting characters, potential for spurious attributes, high computational cost, and tendency to generate simplistic scenes. They suggest future work on reducing these limitations and exploring broader applications like story generation and interactive character design. |
diffusion_model, consistent_character, personalization, text-to-image, clustering, analysis, user_study, sdxl, dinov2 |
2402.11411 |
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning |
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao |
Instruction-following Vision Large Language Models (VLLMs) have achieved
significant progress recently on a variety of tasks. These approaches merge
strong pre-trained vision models and large language models (LLMs). Since these
components are trained separately, the learned representations need to be
aligned with joint training on additional image-language pairs. This procedure
is not perfect and can cause the model to hallucinate - provide answers that do
not accurately reflect the image, even when the core LLM is highly factual and
the vision backbone has sufficiently complete representations. In this work, we
frame the hallucination problem as an alignment issue, tackle it with
preference tuning. Specifically, we propose POVID to generate feedback data
with AI models. We use ground-truth instructions as the preferred response and
a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to
inject plausible hallucinations into the correct answer. Second, we distort the
image to trigger the inherent hallucination behavior of the VLLM. This is an
automated approach, which does not rely on human data generation or require a
perfect expert, which makes it easily scalable. Finally, both of these
generation strategies are integrated into an RLHF pipeline via Direct
Preference Optimization. In experiments across broad benchmarks, we show that
we can not only reduce hallucinations, but improve model performance across
standard benchmarks, outperforming prior approaches. Our data and code are
available at https://github.com/YiyangZhou/POVID. |
This paper introduces POVID, a novel approach for aligning image and text modalities in Vision Large Language Models (VLLMs) to mitigate hallucination issues using AI-generated dispreferences for preference tuning. |
This paper addresses the significant problem of hallucinations in VLLMs, where the model generates text that doesn't accurately reflect the image content. This is crucial for deploying VLLMs in real-world applications where accuracy is paramount. |
The authors propose POVID, a two-stage approach. First, they utilize GPT-4V to create plausible hallucinations in ground-truth image captions and reasoning tasks, generating dispreferred responses. Second, they introduce noise into the input images during training to trigger inherent VLLM hallucination patterns, further improving modality alignment using a modified DPO loss. |
POVID significantly outperforms previous VLLM preference tuning methods, achieving a 31.78% improvement on hallucination benchmarks and consistent gains on comprehensive VLLM benchmarks. It effectively reduces hallucinations and shows superior performance in image captioning and detailed description tasks. |
The paper doesn't explicitly mention limitations. Future work could explore different noise injection techniques, expand to other VLLM architectures, and investigate the generalization of POVID to other multimodal tasks beyond image captioning and reasoning. |
diffusion_model, llm, hallucination, alignment, vllm, image_captioning, reasoning, preference_tuning, dpo |
2309.07986 |
Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models |
James Burgess, Kuan-Chieh Wang, Serena Yeung |
Text-to-image diffusion models understand spatial relationship between
objects, but do they represent the true 3D structure of the world from only 2D
supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image
diffusion models like Stable Diffusion, and we show that this structure can be
exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion
(ViewNeTI), controls the 3D viewpoint of objects in generated images from
frozen diffusion models. We train a small neural mapper to take camera
viewpoint parameters and predict text encoder latents; the latents then
condition the diffusion generation process to produce images with the desired
camera viewpoint.
ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the
frozen diffusion model as a prior, we can solve NVS with very few input views;
we can even do single-view novel view synthesis. Our single-view NVS
predictions have good semantic details and photorealism compared to prior
methods. Our approach is well suited for modeling the uncertainty inherent in
sparse 3D vision problems because it can efficiently generate diverse samples.
Our view-control mechanism is general, and can even change the camera view in
images generated by user-defined prompts. |
This paper introduces Viewpoint Neural Textual Inversion (ViewNeTI), a method for controlling the viewpoint of objects in images generated by text-to-image diffusion models, enabling novel view synthesis from as little as a single input view. |
This paper is important because it demonstrates that 2D diffusion models, despite being trained on unposed images, encode 3D structural knowledge that can be leveraged for 3D vision tasks like novel view synthesis, even with very limited 3D supervision. |
The authors train a small neural network, the view-mapper, to predict text encoder latents based on camera viewpoint parameters. These latents, along with object-specific latents, condition a frozen diffusion model (Stable Diffusion) to generate images from desired viewpoints. They explore single-scene training for viewpoint interpolation and multi-scene pretraining for generalization to novel scenes and single-view synthesis. |
ViewNeTI achieves impressive results for novel view synthesis, especially in the challenging single-view setting. It generates photorealistic images with plausible semantics, outperforming baselines in terms of visual quality and certain metrics like LPIPS. The method also demonstrates potential for viewpoint control in text-to-image generation. |
The paper acknowledges limitations in object localization, which affects PSNR scores, and struggles with generating precise object details. Future work could address these limitations, explore faster inference for object token optimization, and investigate applying the framework to other 3D tasks like relighting and 2D-to-3D lifting. |
diffusion_model, novel_view_synthesis, textual_inversion, 3d, single-view, viewpoint_control, stable diffusion |
2404.19756 |
KAN: Kolmogorov-Arnold Networks |
Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, Max Tegmark |
Inspired by the Kolmogorov-Arnold representation theorem, we propose
Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer
Perceptrons (MLPs). While MLPs have fixed activation functions on nodes
("neurons"), KANs have learnable activation functions on edges ("weights").
KANs have no linear weights at all -- every weight parameter is replaced by a
univariate function parametrized as a spline. We show that this seemingly
simple change makes KANs outperform MLPs in terms of accuracy and
interpretability. For accuracy, much smaller KANs can achieve comparable or
better accuracy than much larger MLPs in data fitting and PDE solving.
Theoretically and empirically, KANs possess faster neural scaling laws than
MLPs. For interpretability, KANs can be intuitively visualized and can easily
interact with human users. Through two examples in mathematics and physics,
KANs are shown to be useful collaborators helping scientists (re)discover
mathematical and physical laws. In summary, KANs are promising alternatives for
MLPs, opening opportunities for further improving today's deep learning models
which rely heavily on MLPs. |
This paper introduces Kolmogorov-Arnold Networks (KANs), a novel neural network architecture inspired by the Kolmogorov-Arnold representation theorem, as a promising alternative to Multi-Layer Perceptrons (MLPs) for function approximation, featuring learnable activation functions on edges. |
The paper is important because it challenges the dominance of MLPs in deep learning by presenting KANs as a more accurate and interpretable alternative, especially in scientific domains. KANs exhibit faster neural scaling laws, better handle the curse of dimensionality for functions with compositional structure, and offer improved interpretability, potentially making them valuable for AI-driven scientific discovery. |
The authors generalize the Kolmogorov-Arnold representation theorem to arbitrary network depths and widths. They parameterize each weight in the network as a learnable 1D spline function, allowing for fine-grained control over function approximation. The paper includes extensive experiments on toy datasets, special functions, Feynman equations, partial differential equations, and real-world scientific datasets in knot theory and condensed matter physics to demonstrate KANs' advantages in accuracy and interpretability. The authors also propose simplification techniques like sparsity regularization and pruning to enhance interpretability. |
KANs consistently outperform MLPs in terms of accuracy and parameter efficiency across various tasks, including function fitting, PDE solving, and symbolic regression. Their test loss scales favorably with the number of parameters, approaching the theoretically predicted scaling exponent. KANs demonstrate an ability to learn complex functions, including special functions and phase transition boundaries. They can be simplified and visualized to reveal underlying compositional structures and enable symbolic regression with human interaction. In applications to scientific datasets, KANs rediscover known mathematical relations in knot theory and uncover mobility edges in condensed matter physics, highlighting their potential for AI-driven scientific discovery. |
The authors acknowledge that the mathematical understanding of deeper KANs is limited and propose a generalized Kolmogorov-Arnold theorem as future work. Algorithmically, they identify potential improvements in accuracy, efficiency, and training strategies, including adaptive grids and hybrid KAN-MLP architectures. They also suggest expanding KAN applications to other scientific domains and integrating them into existing architectures like transformers. A key limitation is the current slow training speed of KANs compared to MLPs. |
diffusion_model, analysis, interpretability, neural_scaling_law, pde, scientific_discovery, symbolic_regression |
2401.08740 |
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers |
Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie |
We present Scalable Interpolant Transformers (SiT), a family of generative
models built on the backbone of Diffusion Transformers (DiT). The interpolant
framework, which allows for connecting two distributions in a more flexible way
than standard diffusion models, makes possible a modular study of various
design choices impacting generative models built on dynamical transport: using
discrete vs. continuous time learning, deciding the objective for the model to
learn, choosing the interpolant connecting the distributions, and deploying a
deterministic or stochastic sampler. By carefully introducing the above
ingredients, SiT surpasses DiT uniformly across model sizes on the conditional
ImageNet 256x256 benchmark using the exact same backbone, number of parameters,
and GFLOPs. By exploring various diffusion coefficients, which can be tuned
separately from learning, SiT achieves an FID-50K score of 2.06. |
This paper introduces Scalable Interpolant Transformers (SiT), a class of generative models based on Diffusion Transformers (DiT) that leverage stochastic interpolants to achieve improved performance in image generation. |
This work is important because it provides a detailed analysis of the design choices involved in building generative models based on dynamical transport, potentially leading to more efficient and higher-performing models. Specifically, it demonstrates a consistent performance gain over DiT by carefully selecting the interpolant connecting data and noise distributions, and by choosing to learn the velocity field of the interpolating process instead of the score. |
The authors start with the DDPM framework and systematically analyze the effects of different design choices on the ImageNet 256x256 benchmark. They experiment with discrete vs. continuous time learning, predicting velocity vs. score, using different interpolants like linear and generalized variance-preserving (GVP), and employing deterministic (Heun) vs. stochastic (Euler-Maruyama) samplers with tunable diffusion coefficients. |
SiT consistently outperforms DiT in FID scores across all model sizes, demonstrating the effectiveness of using stochastic interpolants and learning the velocity field. The paper also finds that SDE-based sampling generally leads to better performance than ODE-based sampling, and that the optimal diffusion coefficient for SDE sampling depends on the choice of interpolant and model. Using classifier-free guidance further enhances SiT's performance, achieving a FID-50K score of 2.06, surpassing DiT in all comparable settings. |
The authors acknowledge that the performance of different samplers might vary under different computational budgets. They plan to explore the application of SiT to other downstream tasks, such as video generation and image editing, in future work. Additionally, they plan to investigate potential performance improvements by combining SiT with other advanced sampling techniques and architectural modifications. |
diffusion_model, gan, interpolant, analysis, image_generation, transformer, sde, ode |
2310.05914 |
NEFTune: Noisy Embeddings Improve Instruction Finetuning |
Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein |
We show that language model finetuning can be improved, sometimes
dramatically, with a simple augmentation. NEFTune adds noise to the embedding
vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca
achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings.
NEFTune also improves over strong baselines on modern instruction datasets.
Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8%
improvement, and with OpenPlatypus an 8% improvement. Even powerful models
further refined with RLHF such as LLaMA-2-Chat benefit from additional training
with NEFTune. |
This paper introduces NEFTune, a simple yet effective technique for improving instruction fine-tuning of large language models (LLMs) by adding noise to embedding vectors during training. |
This paper is important because it presents a novel approach to enhance the performance of instruction-tuned LLMs, addressing the critical need for efficient use of limited instruction datasets in LLM training. |
The authors employ NEFTune, which involves adding scaled uniform noise to the embedding vectors during the forward pass of fine-tuning. They evaluate NEFTune's impact on various LLM architectures, including LLaMA-1, LLaMA-2, and OPT, using different instruction-tuning datasets like Alpaca, Evol-Instruct, ShareGPT, and OpenPlatypus. The evaluation leverages AlpacaEval and OpenLLM Leaderboard tasks to assess the conversational quality and factual accuracy of the models. |
NEFTune significantly improves the performance of LLMs across different model sizes and datasets, leading to more fluent and informative responses. Notably, it exhibits an average improvement of 15% in AlpacaEval Win Rate. Additionally, the authors find that NEFTune helps mitigate overfitting to the instruction datasets, allowing the models to generalize better and generate more human-like responses. |
The authors acknowledge limitations such as reliance on AlpacaEval and limited computational resources for evaluating larger models. Future work includes exploring the impact of NEFTune on model safety and reliability, investigating its effectiveness with larger model variants (e.g., 70B parameters) across multiple datasets, and gaining a deeper understanding of the underlying mechanisms by which NEFTune improves performance. |
diffusion_model, llm, analysis, instruction_finetuning, overfitting, regularization, embedding, conversational_ai |
2402.16842 |
Asymmetry in Low-Rank Adapters of Foundation Models |
Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz Sáez de Ocáriz Borde, Rickard Brüel Gabrielsson, Leshem Choshen, Marzyeh Ghassemi, Mikhail Yurochkin, Justin Solomon |
Parameter-efficient fine-tuning optimizes large, pre-trained foundation
models by updating a subset of parameters; in this class, Low-Rank Adaptation
(LoRA) is particularly effective. Inspired by an effort to investigate the
different roles of LoRA matrices during fine-tuning, this paper characterizes
and leverages unexpected asymmetry in the importance of low-rank adapter
matrices. Specifically, when updating the parameter matrices of a neural
network by adding a product $BA$, we observe that the $B$ and $A$ matrices have
distinct functions: $A$ extracts features from the input, while $B$ uses these
features to create the desired output. Based on this observation, we
demonstrate that fine-tuning $B$ is inherently more effective than fine-tuning
$A$, and that a random untrained $A$ should perform nearly as well as a
fine-tuned one. Using an information-theoretic lens, we also bound the
generalization of low-rank adapters, showing that the parameter savings of
exclusively training $B$ improves the bound. We support our conclusions with
experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs. |
This paper investigates the asymmetry in the roles of adapter matrices in Low-Rank Adaptation (LoRA) for fine-tuning large language models, finding that the matrix projecting input features to a lower dimension (A) plays a less crucial role than the matrix mapping these features to the output (B). |
The paper is important because it provides theoretical and empirical evidence for simplifying and improving the efficiency of LoRA fine-tuning, suggesting that using a fixed, randomly initialized A matrix while solely tuning B can lead to comparable or better performance with reduced parameter usage and improved generalization. |
The authors analyze the asymmetry in LoRA through theoretical analysis of linear regression and nonlinear loss functions, along with empirical evaluations across diverse tasks, including natural language understanding (GLUE, MMLU), generation (XSum, CNN/DailyMail), and image classification (DomainBed) using RoBERTa, BART-Large, LLaMA-2, and Vision Transformer (ViT) models. |
The key results demonstrate that: (1) Tuning only the B matrix in LoRA generally outperforms tuning only A, confirming its greater importance. (2) Using a random orthogonal matrix for A while tuning B can achieve comparable or even superior performance to standard LoRA, especially when the rank of B is increased to match the parameter count, suggesting this approach improves parameter efficiency and generalization. (3) The asymmetry and benefits of tuning only B are observed across different models (RoBERTa, BART-Large, LLaMA-2, ViT) and tasks, including language understanding, generation, and image classification, indicating its broad applicability. |
The paper acknowledges limitations in the theoretical analysis, which primarily focuses on linear models and single-layer networks, and suggests extending the analysis to more complex and realistic network architectures as future work. Further exploration of the relationship between the random initialization of A and input data distribution is also proposed. |
llm, lora, peft, fine-tuning, analysis, generalization, parameter_efficiency, text_generation, text_classification, image_classification |
2404.00384 |
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias |
Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim |
We identify a critical bias in contemporary CLIP-based models, which we
denote as single tag bias. This bias manifests as a disproportionate focus on a
singular tag (word) while neglecting other pertinent tags, stemming from CLIP's
text embeddings that prioritize one specific tag in image-text relationships.
When deconstructing text into individual tags, only one tag tends to have high
relevancy with CLIP's image embedding, leading to biased tag relevancy. In this
paper, we introduce a novel two-step fine-tuning approach, Text-Tag
Self-Distillation (TTD), to address this challenge. TTD first extracts
image-relevant tags from text based on their similarity to the nearest pixels
then employs a self-distillation strategy to align combined masks with the
text-derived mask. This approach ensures the unbiased image-text alignment of
the CLIP-based models using only image-text pairs without necessitating
additional supervision. Our technique demonstrates model-agnostic improvements
in multi-tag classification and segmentation tasks, surpassing competing
methods that rely on external resources. The code is available at
https://github.com/shjo-april/TTD. |
This paper identifies and addresses the "single tag bias" in CLIP-based models, where the models disproportionately focus on a single tag in image-text relationships, and proposes a novel fine-tuning method called Text-Tag Self-Distillation (TTD) to mitigate this bias. |
This paper is important because it addresses a critical limitation in CLIP-based models that hinders their performance in multi-tag classification and segmentation tasks. By mitigating the single tag bias, the paper paves the way for improved image-text alignment and opens up possibilities for more accurate and robust open-vocabulary applications. |
The authors propose a two-step approach: 1) **Tag Selection by Pixel-Tag Scoring:** Instead of relying on global image embeddings prone to bias, they compute similarity scores between each tag and its most correlated pixel, enabling more accurate identification of image-relevant tags. 2) **Text-Tag Self-Distillation:** They generate an ideal image-text similarity map reflecting all relevant tags and use it to guide the model to learn from all relevant tags during fine-tuning, thus mitigating the single tag bias. |
The proposed method demonstrates significant improvements in both multi-tag classification and segmentation tasks. It outperforms existing methods relying on external NLP models for tag selection and achieves superior results in capturing the relationship between images and multi-object text descriptions. The method also shows promising results in open-vocabulary semantic segmentation on various benchmarks, including Pascal VOC, COCO, and ADE20k. |
The authors acknowledge limitations in their current tagging method, which relies on single text inputs per image, potentially limiting the amount of positive/negative tag information utilized during training. As future work, they suggest exploring the integration of multiple text inputs per image to enrich the learning process. Additionally, they plan to investigate the underlying causes of single tag bias, such as model overfitting or training data characteristics, to further enhance the model's performance. |
diffusion_model, clip, analysis, segmentation, open-vocabulary, image-text alignment, self-distillation, bias |
2403.05135 |
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment |
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu |
Diffusion models have demonstrated remarkable performance in the domain of
text-to-image generation. However, most widely used models still employ CLIP as
their text encoder, which constrains their ability to comprehend dense prompts,
encompassing multiple objects, detailed attributes, complex relationships,
long-text alignment, etc. In this paper, we introduce an Efficient Large
Language Model Adapter, termed ELLA, which equips text-to-image diffusion
models with powerful Large Language Models (LLM) to enhance text alignment
without training of either U-Net or LLM. To seamlessly bridge two pre-trained
models, we investigate a range of semantic alignment connector designs and
propose a novel module, the Timestep-Aware Semantic Connector (TSC), which
dynamically extracts timestep-dependent conditions from LLM. Our approach
adapts semantic features at different stages of the denoising process,
assisting diffusion models in interpreting lengthy and intricate prompts over
sampling timesteps. Additionally, ELLA can be readily incorporated with
community models and tools to improve their prompt-following capabilities. To
assess text-to-image models in dense prompt following, we introduce Dense
Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K
dense prompts. Extensive experiments demonstrate the superiority of ELLA in
dense prompt following compared to state-of-the-art methods, particularly in
multiple object compositions involving diverse attributes and relationships. |
This paper introduces ELLA, a lightweight adapter that equips text-to-image diffusion models with Large Language Models (LLMs) to enhance text alignment without retraining either model. It achieves this by using a Timestep-Aware Semantic Connector (TSC) that dynamically extracts timestep-dependent conditions from the LLM to guide the diffusion process. |
This paper addresses the limitation of existing text-to-image diffusion models that struggle with comprehending and following long, dense prompts containing multiple objects, attributes, and relationships. ELLA provides an efficient and effective solution by leveraging the power of LLMs while remaining compatible with existing diffusion models and tools. |
The authors propose a novel architecture, ELLA, that connects a frozen pre-trained LLM (e.g., T5-XL, LLaMA-2) with a frozen pre-trained diffusion model (e.g., Stable Diffusion). The key component, TSC, takes text features from the LLM and the current timestep embedding as input, dynamically extracting semantic information relevant to different stages of the denoising process. To train TSC, the authors constructed a large dataset of image-text pairs with dense captions generated by MLLMs. They also introduce a new benchmark, Dense Prompt Graph Benchmark (DPG-Bench), to evaluate models' ability to follow dense prompts. |
ELLA significantly improves the performance of existing diffusion models in following complex prompts. It outperforms state-of-the-art models on DPG-Bench and shows better text-image alignment than SDXL and PixArt-α in user studies while maintaining comparable aesthetic quality. ELLA's lightweight design allows for easy integration with community models and downstream tools like LoRA and ControlNet, enhancing their prompt-following capabilities. Ablation studies validate the effectiveness of LLM selection, TSC design, and the importance of incorporating timestep information. |
The paper acknowledges limitations in their training captions, which are synthesized by MLLM and might be unreliable for shape and spatial relationships. The authors plan to address this by exploring the integration of MLLM with diffusion models to utilize interleaved image-text input. Another limitation is the potential constraint on the aesthetic quality of generated images due to the frozen U-Net. Future work will focus on image editing capabilities and improving aesthetic quality. |
diffusion_model, llm, text-to-image, semantic_alignment, dense_prompt, timestep-aware, benchmark, analysis |
2310.13730 |
Localizing and Editing Knowledge in Text-to-Image Generative Models |
Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun Manjunatha |
Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have
achieved unprecedented quality of photorealism with state-of-the-art FID scores
on MS-COCO and other generation benchmarks. Given a caption, image generation
requires fine-grained knowledge about attributes such as object structure,
style, and viewpoint amongst others. Where does this information reside in
text-to-image generative models? In our paper, we tackle this question and
understand how knowledge corresponding to distinct visual attributes is stored
in large-scale text-to-image diffusion models. We adapt Causal Mediation
Analysis for text-to-image models and trace knowledge about distinct visual
attributes to various (causal) components in the (i) UNet and (ii) text-encoder
of the diffusion model. In particular, we show that unlike generative
large-language models, knowledge about different attributes is not localized in
isolated components, but is instead distributed amongst a set of components in
the conditional UNet. These sets of components are often distinct for different
visual attributes. Remarkably, we find that the CLIP text-encoder in public
text-to-image models such as Stable-Diffusion contains only one causal state
across different visual attributes, and this is the first self-attention layer
corresponding to the last subject token of the attribute in the caption. This
is in stark contrast to the causal states in other language models which are
often the mid-MLP layers. Based on this observation of only one causal state in
the text-encoder, we introduce a fast, data-free model editing method
Diff-QuickFix which can effectively edit concepts in text-to-image models.
DiffQuickFix can edit (ablate) concepts in under a second with a closed-form
update, providing a significant 1000x speedup and comparable editing
performance to existing fine-tuning based editing methods. |
This paper investigates how knowledge about different visual attributes is stored in large-scale text-to-image diffusion models, specifically focusing on Stable Diffusion. |
Understanding knowledge storage in text-to-image models is crucial for interpreting their decision-making and enabling targeted model editing without expensive retraining. |
The authors adapt Causal Mediation Analysis to trace knowledge corresponding to visual attributes like objects, style, color, and action within the UNet and text-encoder components of Stable Diffusion. They identify causal components by corrupting specific attribute information in captions and observing the impact of restoring activations from a clean model. |
The study reveals that knowledge in the UNet is distributed across various components with different efficacy for different attributes, unlike the localized storage observed in large language models. Remarkably, the CLIP text-encoder exhibits a single causal state: the first self-attention layer corresponding to the last subject token of the attribute in the caption. This finding led to the development of \difffix{}, a fast, data-free model editing method that leverages this localized causal state for efficient concept editing. |
The paper primarily focuses on Stable Diffusion, leaving analysis on other models for future work. Additionally, exploring deeper into individual layer components, such as neurons, and investigating robustness to adversarial attacks are identified as potential research avenues. The authors also acknowledge the need to address the generalization of edits to neighboring concepts, as observed in the Eiffel Tower ablation example where edits did not fully propagate to related scenery. |
diffusion_model, analysis, interpretability, text-to-image, stable-diffusion, causal mediation analysis, model editing |
2308.07673 |
A Review of Adversarial Attacks in Computer Vision |
Yutong Zhang, Yao Li, Yin Li, Zhichang Guo |
Deep neural networks have been widely used in various downstream tasks,
especially those safety-critical scenario such as autonomous driving, but deep
networks are often threatened by adversarial samples. Such adversarial attacks
can be invisible to human eyes, but can lead to DNN misclassification, and
often exhibits transferability between deep learning and machine learning
models and real-world achievability. Adversarial attacks can be divided into
white-box attacks, for which the attacker knows the parameters and gradient of
the model, and black-box attacks, for the latter, the attacker can only obtain
the input and output of the model. In terms of the attacker's purpose, it can
be divided into targeted attacks and non-targeted attacks, which means that the
attacker wants the model to misclassify the original sample into the specified
class, which is more practical, while the non-targeted attack just needs to
make the model misclassify the sample. The black box setting is a scenario we
will encounter in practice. |
This paper presents a comprehensive review of adversarial attacks in computer vision, focusing on their application in image classification, object detection, and semantic segmentation. |
This review is important because it highlights the vulnerability of deep learning models to adversarial attacks, especially in safety-critical applications like autonomous driving where robustness is paramount. It provides insights into various attack methods and their impact on different computer vision tasks, aiding researchers in developing more robust models and defense mechanisms. |
The authors conduct a literature review, categorizing attack methods based on various factors such as the attacker's knowledge (white-box vs. black-box), attack goals (targeted vs. non-targeted), query efficiency, and perturbation generation techniques. They analyze each category, discuss seminal works, and explain the principles behind them. Furthermore, they delve into the application of these attack methods in object detection and semantic segmentation, highlighting specific challenges and advancements in these domains. |
The paper reveals that deep neural networks, even those achieving high accuracy, are surprisingly susceptible to adversarial attacks. Key findings include the effectiveness of both white-box and black-box attacks, the existence of transferable adversarial examples that can fool multiple models, and the feasibility of universal adversarial perturbations effective across a wide range of inputs. Moreover, the paper emphasizes the increased vulnerability of object detection and semantic segmentation models due to their reliance on both classification and localization or pixel-level prediction. |
The paper acknowledges the ongoing arms race between attackers and defenders, indicating that existing defense mechanisms are often bypassed by new attack strategies. It suggests future work should focus on developing more robust models, possibly incorporating insights from the human visual system, and exploring certified defenses with provable robustness guarantees. Additionally, the paper encourages research on attacks and defenses in more complex real-world scenarios, moving beyond simplified assumptions. |
adversarial_attack, computer_vision, image_classification, object_detection, semantic_segmentation, white-box_attack, black-box_attack, transfer_attack, universal_adversarial_perturbation, analysis, literature_review |
2308.07665 |
Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training |
Ximing Xing, Chuang Wang, Haitao Zhou, Zhihao Hu, Chongxuan Li, Dong Xu, Qian Yu |
Exemplar-based sketch-to-photo synthesis allows users to generate
photo-realistic images based on sketches. Recently, diffusion-based methods
have achieved impressive performance on image generation tasks, enabling
highly-flexible control through text-driven generation or energy functions.
However, generating photo-realistic images with color and texture from sketch
images remains challenging for diffusion models. Sketches typically consist of
only a few strokes, with most regions left blank, making it difficult for
diffusion-based methods to produce photo-realistic images. In this work, we
propose a two-stage method named ``Inversion-by-Inversion" for exemplar-based
sketch-to-photo synthesis. This approach includes shape-enhancing inversion and
full-control inversion. During the shape-enhancing inversion process, an
uncolored photo is generated with the guidance of a shape-energy function. This
step is essential to ensure control over the shape of the generated photo. In
the full-control inversion process, we propose an appearance-energy function to
control the color and texture of the final generated photo.Importantly, our
Inversion-by-Inversion pipeline is training-free and can accept different types
of exemplars for color and texture control. We conducted extensive experiments
to evaluate our proposed method, and the results demonstrate its effectiveness.
The code and project can be found at
https://ximinng.github.io/inversion-by-inversion-project/. |
This paper introduces Inversion-by-Inversion, a novel two-stage method for exemplar-based sketch-to-photo synthesis using stochastic differential equations (SDE) without training, allowing users to generate photo-realistic images guided by both a sketch and an exemplar image. |
This paper is important as it addresses the challenge of generating photo-realistic images from sketches, which are inherently sparse, using pre-trained diffusion models. The proposed method effectively combines shape control from sketches with appearance control from exemplar images, advancing the field of sketch-to-photo synthesis. |
The authors propose a two-stage approach: 1) Shape-enhancing inversion: An uncolored photo is generated from the input sketch using a shape-energy function to guide the SDE inversion process, emphasizing shape preservation. 2) Full-control inversion: Using the uncolored photo and an exemplar image, the final photo is generated using both shape-energy and appearance-energy functions to guide the SDE inversion process, adding color and texture from the exemplar while retaining the sketch's shape. |
The paper shows that Inversion-by-Inversion outperforms existing SDE-based image translation methods in terms of FID score and shape fidelity, demonstrating its ability to generate more realistic and shape-consistent images. The method effectively uses various exemplars, including photos, stroke images, segmentation maps, and style images, showcasing its versatility. The ablation study confirms the importance of both the shape-enhancing step and the energy functions for achieving high-quality results. |
The authors acknowledge that future work could explore alternative shape-energy functions and appearance-energy functions to further enhance the performance. Additionally, investigating the generalization ability of the method to handle more complex scenes and diverse sketch styles is a promising direction. |
diffusion_model, sde, sketch-to-photo, exemplar-based, image_synthesis, shape_control, appearance_control, energy_function |
2404.02747 |
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models |
Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, Jürgen Schmidhuber |
This study explores the role of cross-attention during inference in
text-conditional diffusion models. We find that cross-attention outputs
converge to a fixed point after few inference steps. Accordingly, the time
point of convergence naturally divides the entire inference process into two
stages: an initial semantics-planning stage, during which, the model relies on
cross-attention to plan text-oriented visual semantics, and a subsequent
fidelity-improving stage, during which the model tries to generate images from
previously planned semantics. Surprisingly, ignoring text conditions in the
fidelity-improving stage not only reduces computation complexity, but also
maintains model performance. This yields a simple and training-free method
called TGATE for efficient generation, which caches the cross-attention output
once it converges and keeps it fixed during the remaining inference steps. Our
empirical study on the MS-COCO validation set confirms its effectiveness. The
source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE. |
This paper investigates the role of cross-attention in text-to-image diffusion models during inference and finds that cross-attention maps converge quickly, becoming redundant in later inference steps. |
This paper is important because it challenges the assumption that cross-attention is crucial for every inference step in text-to-image diffusion models, offering a potential path to significantly reduce computational cost without sacrificing image quality. |
The authors analyze the role of cross-attention by replacing text embeddings with null embeddings at various stages of the inference process. They then quantitatively evaluate the impact of this replacement on image generation quality using FID scores on the MS-COCO dataset. They also visualize the generated images at different inference steps to understand the dynamic of cross-attention. |
The key findings are that cross-attention outputs converge to a fixed point early in the inference process. The authors leverage this finding to develop \textsc{Tgate}, a training-free method that caches and reuses cross-attention outputs from early inference steps, leading to reduced computational cost (up to 50% reduction in latency) and even slight improvements in FID scores compared to baseline models. Notably, \textsc{Tgate} is effective across various model architectures, including both convolutional and transformer-based diffusion models. |
The authors acknowledge that while \textsc{Tgate} brings quantitative improvements in FID scores, the visual differences in generated images might be subtle for users. As for future work, the authors suggest exploring the impact of scaling token length and image resolution on the efficiency gains provided by \textsc{Tgate}, hinting at its potential benefits for the emerging trend of larger input sizes in diffusion models. |
diffusion_model, cross-attention, inference, efficiency, analysis, text-to-image |
2403.18551 |
Attention Calibration for Disentangled Text-to-Image Personalization |
Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang |
Recent thrilling progress in large-scale text-to-image (T2I) models has
unlocked unprecedented synthesis quality of AI-generated content (AIGC)
including image generation, 3D and video composition. Further, personalized
techniques enable appealing customized production of a novel concept given only
several images as reference. However, an intriguing problem persists: Is it
possible to capture multiple, novel concepts from one single reference image?
In this paper, we identify that existing approaches fail to preserve visual
consistency with the reference image and eliminate cross-influence from
concepts. To alleviate this, we propose an attention calibration mechanism to
improve the concept-level understanding of the T2I model. Specifically, we
first introduce new learnable modifiers bound with classes to capture
attributes of multiple concepts. Then, the classes are separated and
strengthened following the activation of the cross-attention operation,
ensuring comprehensive and self-contained concepts. Additionally, we suppress
the attention activation of different classes to mitigate mutual influence
among concepts. Together, our proposed method, dubbed DisenDiff, can learn
disentangled multiple concepts from one single image and produce novel
customized images with learned concepts. We demonstrate that our method
outperforms the current state of the art in both qualitative and quantitative
evaluations. More importantly, our proposed techniques are compatible with LoRA
and inpainting pipelines, enabling more interactive experiences. |
This paper introduces DisenDiff, a personalized text-to-image generation model that can learn multiple concepts from a single image and generate novel images with those concepts in different contexts. |
The paper addresses a key limitation in existing personalized text-to-image models, which struggle to capture multiple distinct concepts from a single reference image. This is important because it allows for more flexible and creative image generation from a limited amount of input data. |
The authors propose an attention calibration mechanism for a text-to-image diffusion model. This involves introducing new learnable modifiers bound to classes to capture distinct concepts and then applying constraints within the cross-attention mechanism to ensure accurate and disentangled representation of each concept. |
DisenDiff outperforms state-of-the-art methods in both qualitative and quantitative evaluations, demonstrating superior image fidelity and concept disentanglement. The authors also showcase its flexibility in applications like personalized concept inpainting and integration with LoRA for enhanced texture details. |
The authors acknowledge limitations in disentangling fine-grained categories within the same class (e.g., dog breeds) and handling images with more than three concepts. Future work could explore algorithms tailored to these scenarios and address the limitations of existing text-to-image models when dealing with a higher number of concepts. |
diffusion_model, image_generation, personalization, attention_mechanism, disentanglement, text-to-image, inpainting, lora |
2403.14572 |
Implicit Style-Content Separation using B-LoRA |
Yarden Frenkel, Yael Vinker, Ariel Shamir, Daniel Cohen-Or |
Image stylization involves manipulating the visual appearance and texture
(style) of an image while preserving its underlying objects, structures, and
concepts (content). The separation of style and content is essential for
manipulating the image's style independently from its content, ensuring a
harmonious and visually pleasing result. Achieving this separation requires a
deep understanding of both the visual and semantic characteristics of images,
often necessitating the training of specialized models or employing heavy
optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA
(Low-Rank Adaptation) to implicitly separate the style and content components
of a single image, facilitating various image stylization tasks. By analyzing
the architecture of SDXL combined with LoRA, we find that jointly learning the
LoRA weights of two specific blocks (referred to as B-LoRAs) achieves
style-content separation that cannot be achieved by training each B-LoRA
independently. Consolidating the training into only two blocks and separating
style and content allows for significantly improving style manipulation and
overcoming overfitting issues often associated with model fine-tuning. Once
trained, the two B-LoRAs can be used as independent components to allow various
image stylization tasks, including image style transfer, text-based image
stylization, consistent style generation, and style-content mixing. |
This paper presents B-LoRA, a novel method for implicit style-content separation in single images using Low-Rank Adaptation (LoRA) applied to specific transformer blocks in Stable Diffusion XL, enabling various image stylization tasks like style transfer, text-guided stylization, and consistent style generation. |
B-LoRA addresses the limitations of existing image stylization techniques, including overfitting issues associated with model fine-tuning and the need for separate models for style and content. By achieving style-content separation within a single image using a lightweight adapter, it offers flexibility, efficiency, and robust stylization capabilities. |
The authors analyzed SDXL's architecture to identify specific transformer blocks responsible for content and style. They then trained LoRA on these blocks (B-LoRAs) using a single input image and a general text prompt, resulting in an implicit style-content decomposition. The trained B-LoRAs can then be applied to various style manipulation tasks without additional training. |
B-LoRA effectively disentangles style and content, enabling high-quality image style transfer, text-guided style manipulation, and consistent style generation even for challenging inputs like stylized images and complex scenes. Extensive qualitative and quantitative evaluations, including a user study, demonstrate its superiority over alternative approaches. |
The authors acknowledge limitations such as color separation affecting identity preservation, potential style leakage from background elements in style images, and challenges with complex scenes. They suggest future work focusing on finer style-content sub-component separation and extending B-LoRA for multi-object and multi-style combinations. |
diffusion_model, lora, image_stylization, style_transfer, text_guided_image_editing, analysis, sdxl |
2404.03592 |
ReFT: Representation Finetuning for Language Models |
Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts |
Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via
updates to a small number of weights. However, much prior interpretability work
has shown that representations encode rich semantic information, suggesting
that editing representations might be a more powerful alternative. Here, we
pursue this hypothesis by developing a family of $\textbf{Representation
Finetuning (ReFT)}$ methods. ReFT methods operate on a frozen base model and
learn task-specific interventions on hidden representations. We define a strong
instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). LoReFT is
a drop-in replacement for existing PEFTs and learns interventions that are
10x-50x more parameter-efficient than prior state-of-the-art PEFTs. We showcase
LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks,
Alpaca-Eval v1.0, and GLUE. In all these evaluations, LoReFT delivers the best
balance of efficiency and performance, and almost always outperforms
state-of-the-art PEFTs. We release a generic ReFT training library publicly at
https://github.com/stanfordnlp/pyreft. |
This paper introduces ReFT, a novel parameter-efficient fine-tuning method that modifies model representations through learned interventions, outperforming weight-based methods like LoRA in efficiency and achieving state-of-the-art performance on various NLP tasks. |
This paper is important because it challenges the prevailing focus on weight-based PEFTs, proposing a more efficient and interpretable approach by leveraging the rich semantic information encoded in model representations. This approach opens up new possibilities for controlling and understanding large language models. |
The authors develop ReFT, a method that learns low-rank interventions on model representations, inspired by causal abstraction and distributed interchange interventions. They evaluate ReFT on four diverse NLP benchmarks, including commonsense reasoning, arithmetic reasoning, instruction-following, and natural language understanding, comparing its performance and efficiency against existing PEFT methods like LoRA, Adapters, and Prefix-tuning. |
ReFT significantly outperforms previous PEFT methods on commonsense reasoning, instruction-following, and natural language understanding benchmarks, achieving state-of-the-art results while using 10-50 times fewer parameters than LoRA. It also demonstrates strong performance on arithmetic reasoning tasks, surpassing Prefix-tuning. Furthermore, the paper explores the memorization capabilities of ReFT, showing that a single low-rank intervention can store a surprisingly large amount of information, and provides evidence for the superposition of token identities in model representations. |
The authors acknowledge limitations in terms of model diversity, primarily exploring LLaMA-family models. Future work could investigate ReFT's effectiveness on other model families like Mistral or GPT. Further exploration of ReFT's design space, including automating the hyperparameter search and developing more effective interventions for specific tasks like arithmetic reasoning, is also suggested. Additionally, the authors highlight the need for more robust evaluation practices in PEFT research, advocating for benchmarks that prevent test-set hill-climbing and allow for fair comparisons. |
diffusion_model, llm, analysis, interpretability |
2401.00110 |
Diffusion Model with Perceptual Loss |
Shanchuan Lin, Xiao Yang |
Diffusion models trained with mean squared error loss tend to generate
unrealistic samples. Current state-of-the-art models rely on classifier-free
guidance to improve sample quality, yet its surprising effectiveness is not
fully understood. In this paper, we show that the effectiveness of
classifier-free guidance partly originates from it being a form of implicit
perceptual guidance. As a result, we can directly incorporate perceptual loss
in diffusion training to improve sample quality. Since the score matching
objective used in diffusion training strongly resembles the denoising
autoencoder objective used in unsupervised training of perceptual networks, the
diffusion model itself is a perceptual network and can be used to generate
meaningful perceptual loss. We propose a novel self-perceptual objective that
results in diffusion models capable of generating more realistic samples. For
conditional generation, our method only improves sample quality without
entanglement with the conditional input and therefore does not sacrifice sample
diversity. Our method can also improve sample quality for unconditional
generation, which was not possible with classifier-free guidance before. |
This paper proposes a novel "self-perceptual" training objective for diffusion models that leverages the model itself as a perceptual network to improve the realism of generated images. |
This paper addresses the limitations of relying on classifier-free guidance for improving sample quality in diffusion models by introducing a method that enhances realism without sacrificing diversity, works for both conditional and unconditional generation, and is integrated directly into the training process. |
The authors propose a "self-perceptual" objective where a frozen copy of the diffusion model, trained with a standard MSE loss, acts as a perceptual network. During training, the online model generates an image, both images are passed through the perceptual network at a randomly sampled timestep, and the MSE loss between their hidden features is backpropagated to the online model. |
The self-perceptual objective demonstrably improves the realism of generated images, both qualitatively and quantitatively (FID, IS), compared to models trained solely with MSE loss, particularly in unconditional image generation. However, it doesn't yet surpass the performance of classifier-free guidance combined with MSE loss for text-to-image generation. |
The authors acknowledge that the self-perceptual objective currently doesn't outperform classifier-free guidance in text-to-image generation. Additionally, they identify grid-like artifacts in the generated images as an area for future investigation. Future work could focus on refining the perceptual loss mechanism, exploring alternative distance functions, and addressing the identified artifacts. |
diffusion_model, perceptual_loss, image_generation, unconditional_generation, classifier-free_guidance |
2311.17035 |
Scalable Extraction of Training Data from (Production) Language Models |
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee |
This paper studies extractable memorization: training data that an adversary
can efficiently extract by querying a machine learning model without prior
knowledge of the training dataset. We show an adversary can extract gigabytes
of training data from open-source language models like Pythia or GPT-Neo,
semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing
techniques from the literature suffice to attack unaligned models; in order to
attack the aligned ChatGPT, we develop a new divergence attack that causes the
model to diverge from its chatbot-style generations and emit training data at a
rate 150x higher than when behaving properly. Our methods show practical
attacks can recover far more data than previously thought, and reveal that
current alignment techniques do not eliminate memorization. |
This paper investigates "extractable memorization" in large language models, focusing on the ability of adversaries to extract training data from these models without prior knowledge of the training set. |
The paper highlights the significant privacy implications of training large language models, demonstrating that even aligned models like ChatGPT can leak substantial amounts of training data, including personally identifiable information (PII). This raises concerns about the security of training data and the effectiveness of current alignment techniques in preventing memorization. |
The authors develop a scalable methodology to detect memorization in large language models by matching model outputs against publicly available web-scale datasets using suffix arrays. For aligned models like ChatGPT, they introduce a novel "divergence" attack that prompts the model to deviate from its conversational style and emit training data at a much higher rate. They also employ a Good-Turing estimator to extrapolate total memorization based on the rate of unique memorized outputs. |
The authors find that all models, including open-source, semi-closed, and closed (API-based) models, exhibit extractable memorization. Larger and more capable models are more vulnerable to data extraction attacks. Notably, their divergence attack on ChatGPT reveals that it is significantly more susceptible to memorization than previously thought, leaking gigabytes of training data, including PII. They also find that certain words are more effective at eliciting memorized outputs during the divergence attack. The study demonstrates that current alignment techniques do not eliminate memorization and that discoverable memorization is a useful but not perfect proxy for extractable memorization. |
The authors acknowledge that their analysis may underestimate the true memorization rate due to limitations in the size and coverage of their auxiliary dataset. They also note that their attack on ChatGPT is specific to this model and may not generalize to other aligned chatbots. Future work could investigate the effectiveness of data deduplication techniques in mitigating memorization, explore the relationship between model capacity and memorization, and develop more generalizable attacks to assess the privacy of black-box RLHF-aligned models. |
llm, analysis, memorization, privacy, data_extraction, alignment, chatgpt, divergence_attack, suffix_array, good-turing_estimator, pii |
2404.04095 |
Dynamic Prompt Optimizing for Text-to-Image Generation |
Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen, Qing Yang |
Text-to-image generative models, specifically those based on diffusion models
like Imagen and Stable Diffusion, have made substantial advancements. Recently,
there has been a surge of interest in the delicate refinement of text prompts.
Users assign weights or alter the injection time steps of certain words in the
text prompts to improve the quality of generated images. However, the success
of fine-control prompts depends on the accuracy of the text prompts and the
careful selection of weights and time steps, which requires significant manual
intervention. To address this, we introduce the \textbf{P}rompt
\textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original
prompts for image generation, we further employ an online reinforcement
learning strategy to explore the weights and injection time steps of each word,
leading to the dynamic fine-control prompts. The reward function during
training encourages the model to consider aesthetic score, semantic
consistency, and user preferences. Experimental results demonstrate that our
proposed method effectively improves the original prompts, generating visually
more appealing images while maintaining semantic alignment. Code is available
at https://github.com/Mowenyii/PAE. |
This paper introduces PAE, a novel two-stage framework employing reinforcement learning to automatically edit and refine text prompts for diffusion-based text-to-image synthesis, enhancing both image quality and alignment with user intent. |
This work addresses the challenge of manual prompt engineering in text-to-image generation. It enables fine-grained control over image generation by dynamically adjusting word importance and injection time steps in the diffusion process, leading to higher-quality images that better reflect user preferences. |
The authors propose a two-stage training process: 1) Fine-tuning a language model on a curated text-image dataset to refine initial prompts. 2) Using online reinforcement learning to optimize a policy model, which learns to add modifiers with specific effect ranges and weights to the refined prompts, guided by a reward function that considers aesthetic quality, semantic consistency, and user preference. |
PAE generates higher-quality images compared to using short prompts or prompts generated by other methods, evidenced by improved aesthetic scores, CLIP scores, and PickScores. The method demonstrates robust performance on both in-domain and out-of-domain datasets, highlighting its versatility and generalization ability. The learned policy model exhibits a preference for adding modifiers related to art trends, styles, and textures, leading to more visually appealing results without significantly altering the prompt's original meaning. |
The authors acknowledge limitations regarding potential for attribute leakage and missing objects, suggesting the incorporation of control attention maps into the action space for finer control over the generation process as future work. Further improvements could involve integrating additional reward considerations like high resolution and proportional composition to enhance image quality and realism. The paper also suggests exploring techniques to ensure consistent role generation building upon the model's capability to maintain identity consistency. |
diffusion_model, text-to-image, prompt_engineering, reinforcement_learning, aesthetic_quality, semantic_consistency, user_preference |
2308.09991 |
AltDiffusion: A Multilingual Text-to-Image Diffusion Model |
Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu |
Large Text-to-Image(T2I) diffusion models have shown a remarkable capability
to produce photorealistic and diverse images based on text inputs. However,
existing works only support limited language input, e.g., English, Chinese, and
Japanese, leaving users beyond these languages underserved and blocking the
global expansion of T2I models. Therefore, this paper presents AltDiffusion, a
novel multilingual T2I diffusion model that supports eighteen different
languages. Specifically, we first train a multilingual text encoder based on
the knowledge distillation. Then we plug it into a pretrained English-only
diffusion model and train the model with a two-stage schema to enhance the
multilingual capability, including concept alignment and quality improvement
stage on a large-scale multilingual dataset. Furthermore, we introduce a new
benchmark, which includes Multilingual-General-18(MG-18) and
Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I
diffusion models for generating high-quality images and capturing
culture-specific concepts in different languages. Experimental results on both
MG-18 and MC-18 demonstrate that AltDiffusion outperforms current
state-of-the-art T2I models, e.g., Stable Diffusion in multilingual
understanding, especially with respect to culture-specific concepts, while
still having comparable capability for generating high-quality images. All
source code and checkpoints could be found in
https://github.com/superhero-7/AltDiffuson. |
This paper introduces AltDiffusion, a novel multilingual text-to-image diffusion model capable of generating images from prompts in eighteen different languages. |
This paper is important because it addresses the language limitations of existing text-to-image models, making them accessible to a wider global audience and improving their ability to understand and generate images from prompts with culture-specific concepts. |
The authors first train a multilingual text encoder using knowledge distillation from a pre-trained English CLIP model. This encoder is then integrated into a pre-trained English diffusion model and fine-tuned using a two-stage training schema. The first stage aligns the text encoder and the diffusion model's embedding space, while the second stage focuses on improving the quality of generated images using a high-quality multilingual dataset and classifier-free guidance. |
AltDiffusion outperforms existing multilingual text-to-image models in terms of both image quality and multilingual understanding, especially on culture-specific concepts. It achieves comparable results to the English Stable Diffusion model on general prompts and exhibits better performance in understanding and generating images from prompts containing culture-specific concepts. |
The paper does not explicitly mention limitations, but future work could explore expanding the model to support more languages, improving the generation quality for certain languages, and further evaluating the model's capabilities in different downstream applications. |
diffusion_model, multilingual, text-to-image, culture-specific, knowledge_distillation |
2403.16627 |
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions |
Yuda Song, Zehao Sun, Xuanwu Yin |
Recent advancements in diffusion models have positioned them at the forefront
of image generation. Despite their superior performance, diffusion models are
not without drawbacks; they are characterized by complex architectures and
substantial computational demands, resulting in significant latency due to
their iterative sampling process. To mitigate these limitations, we introduce a
dual approach involving model miniaturization and a reduction in sampling
steps, aimed at significantly decreasing model latency. Our methodology
leverages knowledge distillation to streamline the U-Net and image decoder
architectures, and introduces an innovative one-step DM training technique that
utilizes feature matching and score distillation. We present two models,
SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS
(30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU,
respectively. Moreover, our training approach offers promising applications in
image-conditioned control, facilitating efficient image-to-image translation. |
This paper introduces SDXS, a novel approach to distill large-scale diffusion models for text-to-image generation into efficient models capable of real-time inference on GPUs, achieving speeds of up to 100 FPS for 512x512 images and 30 FPS for 1024x1024 images. |
This work is important as it addresses the limitations of traditional diffusion models, which suffer from slow inference speeds due to their multi-step sampling process, hindering their deployment on edge devices or applications requiring real-time performance. |
The authors employ a dual approach: 1) Model miniaturization: Knowledge distillation is used to compress the U-Net and image decoder architectures. 2) One-step training: A novel training technique combines feature matching and score distillation to reduce the sampling process to a single step. |
The resulting models, SDXS-512 and SDXS-1024, demonstrate significant speed improvements (30x and 60x faster than their base counterparts) while maintaining comparable image quality. Furthermore, the proposed method can be adapted for image-conditioned generation tasks using ControlNet, enabling applications like image-to-image translation. |
The authors acknowledge limitations in image diversity when using ControlNet for image-to-image translation. Future work will focus on improving diversity and exploring applications like inpainting and super-resolution, particularly on edge devices. |
diffusion_model, knowledge_distillation, one-step_training, real-time_inference, text-to-image, image-to-image, controlnet, latency_optimization |
2311.17086 |
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation |
Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu |
Text-to-image diffusion models are well-known for their ability to generate
realistic images based on textual prompts. However, the existing works have
predominantly focused on English, lacking support for non-English text-to-image
models. The most commonly used translation methods cannot solve the generation
problem related to language culture, while training from scratch on a specific
language dataset is prohibitively expensive. In this paper, we are inspired to
propose a simple plug-and-play language transfer method based on knowledge
distillation. All we need to do is train a lightweight MLP-like
parameter-efficient adapter (PEA) with only 6M parameters under teacher
knowledge distillation along with a small parallel data corpus. We are
surprised to find that freezing the parameters of UNet can still achieve
remarkable performance on the language-specific prompt evaluation set,
demonstrating that PEA can stimulate the potential generation ability of the
original UNet. Additionally, it closely approaches the performance of the
English text-to-image model on a general prompt evaluation set. Furthermore,
our adapter can be used as a plugin to achieve significant results in
downstream tasks in cross-lingual text-to-image generation. Code will be
available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion |
This paper introduces PEA-Diffusion, a novel method using a plug-and-play adapter and knowledge distillation to adapt English-based text-to-image diffusion models for non-English languages and culture-specific image generation. |
This paper is important because it addresses the limitations of current text-to-image models that primarily focus on English, making them accessible to non-English speakers and enabling the generation of culturally relevant images. |
The authors propose PEA-Diffusion, which uses a lightweight MLP adapter and knowledge distillation from a pre-trained English diffusion model (Stable Diffusion) to guide the learning of a non-English counterpart. They freeze the parameters of the original model, train the adapter with a small parallel corpus, and employ a hybrid training strategy that leverages both parallel and culture-specific image-text pairs. |
PEA-Diffusion achieves significant improvements over baseline methods like translation, AltDiffusion, and GlueGen, particularly in generating culturally relevant images. It demonstrates superior performance on CLIPScore for culture-specific prompts, retains strong performance on general prompts, and exhibits low training costs and plug-and-play capabilities with other downstream tasks like LoRA, ControlNet, and Inpainting. |
The paper acknowledges limitations in the performance of language-specific CLIP encoders, potentially hindering the model's generalizability. Additionally, the approach is limited by the capabilities of the base English model. Future work aims to address these limitations and explore further improvements in both general and culture-specific image generation. |
diffusion_model, language_transfer, knowledge_distillation, multilingual, text-to-image, culture-specific, adapter, parameter-efficient |
2312.03766 |
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment |
Brian Gordon, Yonatan Bitton, Yonatan Shafir, Roopal Garg, Xi Chen, Dani Lischinski, Daniel Cohen-Or, Idan Szpektor |
While existing image-text alignment models reach high quality binary
assessments, they fall short of pinpointing the exact source of misalignment.
In this paper, we present a method to provide detailed textual and visual
explanation of detected misalignments between text-image pairs. We leverage
large language models and visual grounding models to automatically construct a
training set that holds plausible misaligned captions for a given image and
corresponding textual explanations and visual indicators. We also publish a new
human curated test set comprising ground-truth textual and visual misalignment
annotations. Empirical results show that fine-tuning vision language models on
our training set enables them to articulate misalignments and visually indicate
them within images, outperforming strong baselines both on the binary alignment
classification and the explanation generation tasks. Our method code and human
curated test set are available at: https://mismatch-quest.github.io/ |
This paper presents a method to explain the misalignment between text and images in image-text alignment models by leveraging LLMs and visual grounding models to generate plausible misaligned captions and their corresponding textual and visual explanations. |
This paper is important because it addresses the limitation of existing image-text alignment models which only provide a binary assessment of alignment and fail to pinpoint the source of misalignment. The proposed method enables detailed understanding of misalignment causes and facilitates the development of better image-text alignment models. |
The authors propose a method called Mismatch-Quest which first collects aligned image-text pairs from various datasets, then utilizes LLMs to generate misaligned captions along with their textual and visual explanations. To ensure quality, they validate the generated captions and feedback using entailment models and utilize a visual grounding model to annotate the misalignments with bounding boxes. |
The authors create a comprehensive training set named TV-Feedback with 3 million instances. They also introduce a human-annotated test set named Mismatch-Quest Benchmark with 2,008 instances. Fine-tuning PaLI vision language models on TV-Feedback outperforms other baselines on both binary alignment classification and explanation generation tasks, achieving over 10% improvement in alignment accuracy and 20% in textual feedback entailment. |
The authors identify limitations like failing to handle scenarios with no visual feedback expected and struggling with instances requiring identification of multiple misalignments. Future work includes enriching the training set with such scenarios to improve the model's ability to address diverse misalignment types. |
image-text alignment, llm, visual grounding, misalignment explanation, dataset, analysis, evaluation |
2311.13600 |
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs |
Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, Varun Jampani |
Methods for finetuning generative models for concept-driven personalization
generally achieve strong results for subject-driven or style-driven generation.
Recently, low-rank adaptations (LoRA) have been proposed as a
parameter-efficient way of achieving concept-driven personalization. While
recent work explores the combination of separate LoRAs to achieve joint
generation of learned styles and subjects, existing techniques do not reliably
address the problem; they often compromise either subject fidelity or style
fidelity. We propose ZipLoRA, a method to cheaply and effectively merge
independently trained style and subject LoRAs in order to achieve generation of
any user-provided subject in any user-provided style. Experiments on a wide
range of subject and style combinations show that ZipLoRA can generate
compelling results with meaningful improvements over baselines in subject and
style fidelity while preserving the ability to recontextualize. Project page:
https://ziplora.github.io |
This paper introduces ZipLoRA, a novel optimization-based method for merging independently trained style and content LoRAs (Low-Rank Adaptations) for text-to-image diffusion models. This allows for the generation of any user-provided subject in any user-provided style, enabling personalized and stylized image creation. |
This paper is important because it addresses a key limitation in existing text-to-image generation models: the ability to combine specific subjects with specific styles in a controllable and efficient manner. It achieves this by efficiently merging independently trained LoRAs, allowing for versatile and personalized image generation while preserving the subject's identity and desired style. |
The authors leverage two key insights: (1) sparsity of LoRA weight update matrices and (2) poor performance of directly merging highly aligned LoRA weights. They propose an optimization method that learns to merge style and content LoRAs by minimizing a loss function that encourages both style and subject fidelity while minimizing signal interference between the two LoRAs. |
ZipLoRA demonstrates superior performance compared to direct merging, joint training, and StyleDrop methods. It shows impressive results in generating stylized images while preserving subject fidelity and allows for control over the extent of stylization. The method also retains the ability to generate individual concepts (subject or style) accurately, demonstrating its versatility. User studies and quantitative metrics further highlight ZipLoRA's effectiveness in achieving personalized stylizations. |
The authors do not explicitly mention limitations. However, potential areas for future work could include exploring: (1) extension of ZipLoRA to handle multiple styles or subjects, (2) exploring alternative optimization strategies or regularization techniques for more robust merging, and (3) investigating the application of ZipLoRA to other diffusion-based generative tasks beyond image stylization. |
diffusion_model, lora, stylization, personalization, image_generation, text-to-image, sdxl, dreambooth, styledrop |
2402.01103 |
Compositional Generative Modeling: A Single Model is Not All You Need |
Yilun Du, Leslie Kaelbling |
Large monolithic generative models trained on massive amounts of data have
become an increasingly dominant approach in AI research. In this paper, we
argue that we should instead construct large generative systems by composing
smaller generative models together. We show how such a compositional generative
approach enables us to learn distributions in a more data-efficient manner,
enabling generalization to parts of the data distribution unseen at training
time. We further show how this enables us to program and construct new
generative models for tasks completely unseen at training. Finally, we show
that in many cases, we can discover separate compositional components from
data. |
This paper argues for a compositional approach to generative modeling, proposing the construction of large generative systems by composing smaller generative models instead of relying solely on large monolithic models. |
This paper is important because it addresses limitations of current large generative models, such as poor compositionality, data inefficiency, and difficulty in adaptation. The proposed compositional approach offers a more scalable, data-efficient, and generalizable alternative. |
The authors present a theoretical framework for compositional generative modeling and illustrate its benefits in various domains including image synthesis, trajectory modeling, and planning. They demonstrate how composing simpler models can represent complex distributions more effectively, generalize to unseen data regions, and enable the construction of new generative models for unseen tasks. They also discuss methods for discovering compositional components from data. |
The paper shows that compositional models are more data-efficient, generalize better to unseen data, and can be composed to solve new tasks. For example, composing models trained on different subsets of data allows for generating hybrid scenes with elements from each subset. Additionally, the paper demonstrates how compositional models can be used for planning, constraint satisfaction, and style adaptation in video generation. |
The paper acknowledges limitations in implementing compositional sampling with common generative model parameterizations and suggests using Energy-Based Models (EBMs) as a solution. Future work includes developing efficient methods for sampling from joint distributions, discovering compositional structures, and dynamically adapting the structure of generative models under distribution shift. |
generative_modeling, modularity, compositionality, ebm, diffusion_model, analysis, video, image, planning |
2308.10187 |
Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks |
Mingxuan Liu, Jie Gan, Rui Wen, Tao Li, Yongli Chen, Hong Chen |
Spiking neural networks (SNNs) have tremendous potential for energy-efficient
neuromorphic chips due to their binary and event-driven architecture. SNNs have
been primarily used in classification tasks, but limited exploration on image
generation tasks. To fill the gap, we propose a Spiking-Diffusion model, which
is based on the vector quantized discrete diffusion model. First, we develop a
vector quantized variational autoencoder with SNNs (VQ-SVAE) to learn a
discrete latent space for images. In VQ-SVAE, image features are encoded using
both the spike firing rate and postsynaptic potential, and an adaptive spike
generator is designed to restore embedding features in the form of spike
trains. Next, we perform absorbing state diffusion in the discrete latent space
and construct a spiking diffusion image decoder (SDID) with SNNs to denoise the
image. Our work is the first to build the diffusion model entirely from SNN
layers. Experimental results on MNIST, FMNIST, KMNIST, Letters, and Cifar10
demonstrate that Spiking-Diffusion outperforms the existing SNN-based
generation model. We achieve FIDs of 37.50, 91.98, 59.23, 67.41, and 120.5 on
the above datasets respectively, with reductions of 58.60\%, 18.75\%, 64.51\%,
29.75\%, and 44.88\% in FIDs compared with the state-of-art work. Our code will
be available at \url{https://github.com/Arktis2022/Spiking-Diffusion}. |
This paper introduces Spiking-Diffusion, a novel generative model for image generation that utilizes spiking neural networks (SNNs) to achieve both energy efficiency and biological plausibility. |
This paper is significant because it is the first to successfully implement a diffusion model entirely using SNN layers, opening up new possibilities for energy-efficient and brain-inspired image generation. Previous SNN-based generative models faced limitations in quality and capacity, making this a notable advancement in the field. |
The authors develop Spiking-Diffusion in two stages: 1) **VQ-SVAE**: They create a Vector Quantized Spiking Variational Autoencoder to learn discrete latent representations of images. This involves encoding image features using spike firing rate (SFR) and postsynaptic potential (PSP), and designing an adaptive spike generator (ASG) to convert embeddings back into spike trains for the decoder. 2) **SDID**: They employ a Spiking Diffusion Image Decoder trained on the discrete latent space. They utilize an absorbing state diffusion process, gradually masking the discrete image representation, and the SDID learns to reverse this process, effectively denoising the image. |
Spiking-Diffusion outperforms the current state-of-the-art SNN-based generative model (FSVAE) on various image datasets, including MNIST, FMNIST, KMNIST, Letters, and Cifar10. It demonstrates lower reconstruction error (MSE, SSIM) and better-generated image quality (FID, KID). |
The paper acknowledges the need to explore the training of larger-scale SNN generative models in future work. This suggests scaling up the model and exploring more complex datasets to further validate and improve Spiking-Diffusion's capabilities. |
diffusion_model, gan, snn, image_generation, vq-vae, neuromorphic, energy_efficient, biological_plausibility |
2401.15708 |
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding |
Jianxiang Lu, Cong Xie, Hui Guo |
As large-scale text-to-image generation models have made remarkable progress
in the field of text-to-image generation, many fine-tuning methods have been
proposed. However, these models often struggle with novel objects, especially
with one-shot scenarios. Our proposed method aims to address the challenges of
generalizability and fidelity in an object-driven way, using only a single
input image and the object-specific regions of interest. To improve
generalizability and mitigate overfitting, in our paradigm, a prototypical
embedding is initialized based on the object's appearance and its class, before
fine-tuning the diffusion model. And during fine-tuning, we propose a
class-characterizing regularization to preserve prior knowledge of object
classes. To further improve fidelity, we introduce object-specific loss, which
can also use to implant multiple objects. Overall, our proposed object-driven
method for implanting new objects can integrate seamlessly with existing
concepts as well as with high fidelity and generalization. Our method
outperforms several existing works. The code will be released. |
This paper presents a novel object-driven one-shot fine-tuning method for text-to-image diffusion models, enabling the generation of diverse images with specific objects from a single input image and region of interest. |
This paper is significant because it addresses the challenges of limited data and object fidelity in personalized text-to-image generation. It allows for efficient object implantation and diverse image synthesis with high fidelity using only one reference image, advancing the field of content creation. |
The authors leverage prototypical embedding for initialization, class-characterizing regularization to preserve class diversity, and an object-specific loss function to enhance fidelity. They fine-tune a pre-trained stable diffusion model using a single image and its object mask, and compare their method with existing techniques through qualitative and quantitative evaluations. |
The proposed method outperforms existing one-shot fine-tuning methods in terms of both object fidelity and generalization ability. It effectively mitigates overfitting and allows for the generation of diverse images with the target object while maintaining consistency with text prompts. The method also demonstrates success in multi-object implantation, enabling the creation of compositions with user-specified objects. |
The authors acknowledge limitations in handling objects with complex edges, which can lead to degraded image quality. They also point out that smaller objects may have reduced fidelity in the generated images. Future work will focus on improving mask acquisition methods and incorporating multi-scale perception mechanisms for objects to address these limitations. |
diffusion_model, one-shot, fine-tuning, text-to-image, prototypical_embedding, object-driven, fidelity, generalization |
2311.10329 |
High-fidelity Person-centric Subject-to-Image Synthesis |
Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin |
Current subject-driven image generation methods encounter significant
challenges in person-centric image generation. The reason is that they learn
the semantic scene and person generation by fine-tuning a common pre-trained
diffusion, which involves an irreconcilable training imbalance. Precisely, to
generate realistic persons, they need to sufficiently tune the pre-trained
model, which inevitably causes the model to forget the rich semantic scene
prior and makes scene generation over-fit to the training data. Moreover, even
with sufficient fine-tuning, these methods can still not generate high-fidelity
persons since joint learning of the scene and person generation also lead to
quality compromise. In this paper, we propose Face-diffuser, an effective
collaborative generation pipeline to eliminate the above training imbalance and
quality compromise. Specifically, we first develop two specialized pre-trained
diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented
Diffusion Model (SDM), for scene and person generation, respectively. The
sampling process is divided into three sequential stages, i.e., semantic scene
construction, subject-scene fusion, and subject enhancement. The first and last
stages are performed by TDM and SDM respectively. The subject-scene fusion
stage, that is the collaboration achieved through a novel and highly effective
mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on
our key observation that there exists a robust link between classifier-free
guidance responses and the saliency of generated images. In each time step, SNF
leverages the unique strengths of each model and allows for the spatial
blending of predicted noises from both models automatically in a saliency-aware
manner. Extensive experiments confirm the impressive effectiveness and
robustness of the Face-diffuser. |
This paper introduces Face-diffuser, a novel collaborative generation pipeline for subject-driven text-to-image generation that addresses limitations of existing methods in person-centric image synthesis by employing two specialized diffusion models for enhanced scene and person generation. |
This paper is important because it tackles the training imbalance and quality compromise issues prevalent in current subject-driven image generation models, especially for person-centric synthesis. Face-diffuser's innovative approach enhances the fidelity of generated persons within diverse semantic scenes, advancing the field of personalized image generation. |
The authors propose Face-diffuser, which utilizes two pre-trained diffusion models: TDM for scene generation and SDM for person generation. The generation process involves three stages: initial scene construction using TDM, subject-scene fusion through a novel Saliency-adaptive Noise Fusion (SNF) mechanism, and final subject enhancement by SDM. SNF leverages classifier-free guidance responses to dynamically allocate regions for each model's contribution during synthesis, enabling seamless collaboration. |
Face-diffuser demonstrates superior performance in both single- and multi-subject generation tasks, quantitatively outperforming state-of-the-art methods in terms of identity preservation and prompt consistency. Qualitative results showcase its ability to generate high-fidelity, coherent images of individuals within diverse contexts, surpassing baselines in preserving subject details and scene semantics. Ablation studies confirm the efficacy of each stage in the pipeline and the superiority of SNF over simpler fusion techniques. |
Limitations include the potential for privacy concerns due to the close resemblance of generated persons to reference images and challenges in editing attributes of generated individuals. Future work aims to address these limitations and explore attribute editing capabilities. |
diffusion_model, image_generation, subject-driven, person-centric, saliency, collaborative_generation |
2404.03673 |
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation |
Owen Oertell, Jonathan D. Chang, Yiyi Zhang, Kianté Brantley, Wen Sun |
Reinforcement learning (RL) has improved guided image generation with
diffusion models by directly optimizing rewards that capture image quality,
aesthetics, and instruction following capabilities. However, the resulting
generative policies inherit the same iterative sampling process of diffusion
models that causes slow generation. To overcome this limitation, consistency
models proposed learning a new class of generative models that directly map
noise to data, resulting in a model that can generate an image in as few as one
sampling iteration. In this work, to optimize text-to-image generative models
for task specific rewards and enable fast training and inference, we propose a
framework for fine-tuning consistency models via RL. Our framework, called
Reinforcement Learning for Consistency Model (RLCM), frames the iterative
inference process of a consistency model as an RL procedure. RLCM improves upon
RL fine-tuned diffusion models on text-to-image generation capabilities and
trades computation during inference time for sample quality. Experimentally, we
show that RLCM can adapt text-to-image consistency models to objectives that
are challenging to express with prompting, such as image compressibility, and
those derived from human feedback, such as aesthetic quality. Comparing to RL
finetuned diffusion models, RLCM trains significantly faster, improves the
quality of the generation measured under the reward objectives, and speeds up
the inference procedure by generating high quality images with as few as two
inference steps. Our code is available at https://rlcm.owenoertell.com |
This paper introduces RLCM, a novel framework for enhancing text-to-image consistency models by leveraging reinforcement learning to optimize for specific reward functions, resulting in faster training and inference compared to diffusion models. |
The paper addresses limitations in text-to-image generation using diffusion models, such as difficulty in aligning with specific prompts and slow inference speed. It leverages consistency models, which offer faster generation, and proposes an RL-based approach to fine-tune them for better alignment with downstream tasks. |
The authors formulate the iterative inference of a consistency model as a Markov Decision Process (MDP) with a shorter horizon compared to diffusion models. They utilize a policy gradient algorithm, RLCM, to optimize the consistency model's policy by maximizing rewards associated with desired image properties. Experiments compare RLCM to DDPO (an RL method for diffusion models) on tasks like image compressibility, aesthetics, and prompt alignment. |
RLCM demonstrates faster training and inference than DDPO while achieving comparable or better image quality across various tasks. Notably, RLCM shows a 17x speedup in training time on the aesthetic task. Ablation studies highlight the trade-off between inference time and image quality achievable by adjusting the number of inference steps in RLCM. |
The authors acknowledge limitations such as the use of sparse rewards in the current policy gradient method and suggest exploring dense reward strategies. Future work could also focus on developing loss functions that reinforce consistency, potentially further improving inference speed. |
diffusion_model, consistency_model, rl, text-to-image, inference, optimization, aesthetic, image_generation |
2402.14792 |
Consolidating Attention Features for Multi-view Image Editing |
Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre |
Large-scale text-to-image models enable a wide range of image editing
techniques, using text prompts or even spatial controls. However, applying
these editing methods to multi-view images depicting a single scene leads to
3D-inconsistent results. In this work, we focus on spatial control-based
geometric manipulations and introduce a method to consolidate the editing
process across various views. We build on two insights: (1) maintaining
consistent features throughout the generative process helps attain consistency
in multi-view editing, and (2) the queries in self-attention layers
significantly influence the image structure. Hence, we propose to improve the
geometric consistency of the edited images by enforcing the consistency of the
queries. To do so, we introduce QNeRF, a neural radiance field trained on the
internal query features of the edited images. Once trained, QNeRF can render
3D-consistent queries, which are then softly injected back into the
self-attention layers during generation, greatly improving multi-view
consistency. We refine the process through a progressive, iterative method that
better consolidates queries across the diffusion timesteps. We compare our
method to a range of existing techniques and demonstrate that it can achieve
better multi-view consistency and higher fidelity to the input scene. These
advantages allow us to train NeRFs with fewer visual artifacts, that are better
aligned with the target geometry. |
This paper introduces a method for consistent multi-view image editing, focusing on geometric manipulations like articulations and shape changes using spatial controls and a novel query feature space neural radiance field called QNeRF. |
This work addresses the limitations of existing multi-view image editing techniques that struggle with consistent geometric modifications across multiple views, offering a solution for more realistic and high-fidelity edits. |
The authors leverage ControlNet and a pre-trained Stable Diffusion model to edit images based on spatial controls. They introduce QNeRF, trained on query features from self-attention layers, to progressively consolidate these features during denoising, ensuring consistency across views. |
The proposed method achieves greater visual quality and consistency in multi-view edits compared to baseline methods like InstructNeRF2NeRF and TokenFlow, as demonstrated through qualitative results, KID and FID scores, and user preference evaluations. It allows for training NeRFs with fewer artifacts and better alignment to the target geometry. |
Limitations include difficulties in generating highly detailed structures like hands, potential for hallucinating inconsistent details in complex objects, and reliance on a black-box optimizer for QNeRF training. Future work could explore robust statistics for QNeRF optimization, alternative 3D representations like Gaussian Splats, and addressing the limitations inherited from text-to-image models. |
diffusion_model, nerf, 3d, multi-view, image_editing, geometric_editing, consistency, self-attention |
2402.16828 |
Training Neural Networks from Scratch with Parallel Low-Rank Adapters |
Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal |
The scalability of deep learning models is fundamentally limited by computing
resources, memory, and communication. Although methods like low-rank adaptation
(LoRA) have reduced the cost of model finetuning, its application in model
pre-training remains largely unexplored. This paper explores extending LoRA to
model pre-training, identifying the inherent constraints and limitations of
standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel
bi-level optimization algorithm designed to enable parallel training of
multiple low-rank heads across computing nodes, thereby reducing the need for
frequent synchronization. Our approach includes extensive experimentation on
vision transformers using various vision datasets, demonstrating that LTE is
competitive with standard pre-training. |
This paper introduces LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm for training neural networks from scratch using parallel low-rank adapters, addressing the limitations of standard low-rank adaptation in model pre-training. |
This paper is important because it tackles the challenge of pre-training large models with limited computing resources by leveraging low-rank adaptations, potentially enabling training on less powerful devices and reducing communication bottlenecks. |
The authors propose LTE, which trains multiple low-rank adapter heads in parallel on different data shards with infrequent synchronization, merging the updates to the main model weights periodically, and reducing communication overhead. |
LTE demonstrates competitive performance compared to standard pre-training across various vision tasks and datasets, achieving comparable accuracy with potential for memory and communication efficiency. |
Limitations include slower convergence in the later stages of training and the need for further investigation into optimal hyperparameter selection, such as rank and number of heads. Future work involves exploring dynamic rank and head allocation, heterogeneous LoRA parameterization, and advanced merging strategies. |
diffusion_model, llm, analysis, 3d, motion, video, interpretability |
2310.14729 |
MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion |
Roy Kapon, Guy Tevet, Daniel Cohen-Or, Amit H. Bermano |
We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion
generation, using 2D diffusion models that were trained on motions obtained
from in-the-wild videos. As such, MAS opens opportunities to exciting and
diverse fields of motion previously under-explored as 3D data is scarce and
hard to collect. MAS works by simultaneously denoising multiple 2D motion
sequences representing different views of the same 3D motion. It ensures
consistency across all views at each diffusion step by combining the individual
generations into a unified 3D sequence, and projecting it back to the original
views. We demonstrate MAS on 2D pose data acquired from videos depicting
professional basketball maneuvers, rhythmic gymnastic performances featuring a
ball apparatus, and horse races. In each of these domains, 3D motion capture is
arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the
Score Distillation approach, which optimizes each sample by repeatedly applying
small fixes, our method uses a sampling process that was constructed for the
diffusion framework. As we demonstrate, MAS avoids common issues such as
out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/ |
This paper introduces Multi-view Ancestral Sampling (MAS), a novel method for generating 3D human and animal motions using 2D diffusion models trained on in-the-wild videos. |
This research is significant as it allows for 3D motion generation in domains where acquiring 3D data is expensive or impractical, such as basketball, horse racing, and rhythmic gymnastics. |
The authors first train a 2D motion diffusion model on poses extracted from videos. Then, they utilize MAS, which simultaneously generates multiple 2D views of a 3D motion via ancestral sampling, ensuring consistency across views by triangulating the generated 2D poses into a 3D motion at each denoising step. |
MAS successfully generates diverse and realistic 3D motions, outperforming existing pose lifting methods and a DreamFusion adaptation for unconditional motion generation. The method's reliance on ancestral sampling results in faster generation times and avoids common issues like out-of-distribution sampling and mode collapse. |
Limitations include occasional character self-intersection and scale inconsistencies. Future work could address predicting global position, enabling textual control, and extending the method to multi-person interactions, hand and face motions, and complex object manipulations. |
diffusion_model, 3d, motion, video, analysis, motion_generation |
2312.02663 |
FaceStudio: Put Your Face Everywhere in Seconds |
Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, Bin Fu |
This study investigates identity-preserving image synthesis, an intriguing
task in image generation that seeks to maintain a subject's identity while
adding a personalized, stylistic touch. Traditional methods, such as Textual
Inversion and DreamBooth, have made strides in custom image creation, but they
come with significant drawbacks. These include the need for extensive resources
and time for fine-tuning, as well as the requirement for multiple reference
images. To overcome these challenges, our research introduces a novel approach
to identity-preserving synthesis, with a particular focus on human images. Our
model leverages a direct feed-forward mechanism, circumventing the need for
intensive fine-tuning, thereby facilitating quick and efficient image
generation. Central to our innovation is a hybrid guidance framework, which
combines stylized images, facial images, and textual prompts to guide the image
generation process. This unique combination enables our model to produce a
variety of applications, such as artistic portraits and identity-blended
images. Our experimental results, including both qualitative and quantitative
evaluations, demonstrate the superiority of our method over existing baseline
models and previous works, particularly in its remarkable efficiency and
ability to preserve the subject's identity with high fidelity. |
This paper introduces a novel, tuning-free method for identity-preserving image synthesis, focusing on efficiently generating human images in various styles while maintaining individual identities using a hybrid guidance framework combining style images, facial images, and text prompts. |
This paper addresses limitations in existing identity-preserving image synthesis methods, which often require resource-intensive fine-tuning and multiple reference images. The proposed method offers a faster and more efficient alternative by using a direct feed-forward approach and hybrid guidance, enabling diverse applications like artistic portrait creation and identity blending. |
The authors develop a hybrid guidance framework that combines style images, facial images, and text prompts to guide a latent diffusion model. They extract identity features from facial images using Arcface and combine them with text embeddings from a prior model trained to map CLIP text embeddings to vision embeddings. A multi-identity cross-attention mechanism is introduced to handle multiple identities within a single image, ensuring each individual's features are correctly mapped. The model is trained on a human image reconstruction task, using masked images as style input and cropped faces as identity input. |
The proposed method demonstrates superior performance in preserving identities during image synthesis compared to baseline models like DreamBooth and Textual Inversion, achieving higher face similarity scores in both single- and multi-image settings. The ablation study confirms the significance of the identity input for maintaining identity fidelity. The model also exhibits strong performance in novel view synthesis, effectively generating images with large pose changes while preserving identity. Furthermore, the method demonstrates successful identity mixing and multi-human image generation with accurate identity mapping. |
The authors acknowledge that compared to methods like DreamBooth, their model is currently limited to human image generation. As future work, they plan to extend its capabilities to encompass a wider range of subjects, including animals and objects. Additionally, they recognize the ethical considerations and potential for misuse, such as copyright infringement and the creation of inappropriate content. The authors emphasize the importance of responsible use and the establishment of guidelines to mitigate these risks. |
diffusion_model, image_synthesis, identity_preserving, hybrid_guidance, text-to-image, multi-identity, tuning-free, face_recognition, novel_view_synthesis |
2404.05729 |
Finding Visual Task Vectors |
Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar |
Visual Prompting is a technique for teaching models to perform a visual task
via in-context examples, without any additional training. In this work, we
analyze the activations of MAE-VQGAN, a recent Visual Prompting model, and find
task vectors, activations that encode task-specific information. Equipped with
this insight, we demonstrate that it is possible to identify the task vectors
and use them to guide the network towards performing different tasks without
providing any input-output examples. To find task vectors, we compute the
average intermediate activations per task and use the REINFORCE algorithm to
search for the subset of task vectors. The resulting task vectors guide the
model towards performing a task better than the original model without the need
for input-output examples. |
This paper investigates the existence and identification of "task vectors" in visual prompting models, specifically focusing on MAE-VQGAN. The authors propose a method to identify these task-specific activations and demonstrate that patching them into the model enables zero-shot task performance comparable to or exceeding the original one-shot in-context learning. |
This paper is significant as it sheds light on the inner workings of visual in-context learning, a relatively new and less understood area compared to its NLP counterpart. Identifying and leveraging task vectors could lead to more efficient and adaptable visual prompting models, reducing the reliance on extensive in-context examples. |
The authors first analyze MAE-VQGAN activations to identify potential task vectors by measuring their variance across different tasks and invariance within a task. Then, they employ a REINFORCE algorithm to search for the optimal subset of task vectors that minimize the task-specific loss when patched into the model. They evaluate their method on various image-to-image tasks using the Pascal-5i dataset. |
The paper shows that task vectors do exist in visual prompting models and can be effectively identified. Patching the identified task vectors allows MAE-VQGAN to perform tasks in a zero-shot manner, achieving comparable or even superior performance to the original one-shot prompting on tasks like foreground segmentation, low-light enhancement, in-painting, and colorization. The results also suggest that task vectors are distributed throughout the encoder and decoder of the network. |
The authors acknowledge limitations in exploring other potential vector types, such as those encoding image structure and positional information. They also point to the possibility of directly evaluating the model in the VQGAN token space for potentially more accurate results. Future work could involve investigating these aspects further, as well as exploring the generalization of task vectors across different datasets and models. |
diffusion_model, visual_prompting, in-context_learning, analysis, task_vectors, zero-shot, mae, vqgan, attention, reinforce |
2310.12274 |
An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning |
Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare |
Textural Inversion, a prompt learning method, learns a singular embedding for
a new "word" to represent image style and appearance, allowing it to be
integrated into natural language sentences to generate novel synthesised
images. However, identifying and integrating multiple object-level concepts
within one scene poses significant challenges even when embeddings for
individual concepts are attainable. This is further confirmed by our empirical
tests. To address this challenge, we introduce a framework for Multi-Concept
Prompt Learning (MCPL), where multiple new "words" are simultaneously learned
from a single sentence-image pair. To enhance the accuracy of word-concept
correlation, we propose three regularisation techniques: Attention Masking
(AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss
(PromptCL) to separate the embeddings of different concepts; and Bind adjective
(Bind adj.) to associate new "words" with known words. We evaluate via image
generation, editing, and attention visualisation with diverse images. Extensive
quantitative comparisons demonstrate that our method can learn more
semantically disentangled concepts with enhanced word-concept correlation.
Additionally, we introduce a novel dataset and evaluation protocol tailored for
this new task of learning object-level concepts. |
This paper introduces Multi-Concept Prompt Learning (MCPL), a method for learning multiple textural embeddings (new "words") in text-to-image diffusion models, which represent distinct object-level concepts within a single image. |
This paper addresses a significant limitation in existing textural inversion techniques, which struggle to learn and compose multiple concepts from a single image, hindering their application in complex multi-object editing and generation tasks. |
The authors propose MCPL, building upon Textural Inversion, to jointly learn multiple embeddings by optimizing the diffusion model loss on a single image with multiple learnable prompts. To enhance object-level concept learning, they introduce three regularization techniques: Attention Masking to focus learning on relevant image regions, Prompts Contrastive Loss to separate embeddings of different concepts, and binding learnable prompts with adjectives to leverage pre-trained knowledge. |
Experiments on natural and biomedical image datasets demonstrate that MCPL, particularly with all the proposed regularizations, effectively learns disentangled object-level embeddings, outperforming existing techniques in terms of concept separation and fidelity to both text prompts and image regions. The approach enables more accurate object-level synthesis, editing, and understanding of multi-object relationships. |
The paper acknowledges limitations in the estimation of "ground truth" embeddings using masks and suggests exploring alternative evaluation metrics beyond those used for single-concept learning. Future work includes exploring better prompt selection strategies and extending MCPL to handle a larger number of concepts within a scene. |
diffusion_model, textural_inversion, prompt_learning, multi-concept, object-level, attention_mechanism, contrastive_learning, image_generation, image_editing, disentanglement |
2311.12908 |
Diffusion Model Alignment Using Direct Preference Optimization |
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik |
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better
aligned with users' preferences. In contrast to LLMs, human preference learning
has not been widely explored in text-to-image diffusion models; the best
existing approach is to fine-tune a pretrained model using carefully curated
high quality images and captions to improve visual appeal and text alignment.
We propose Diffusion-DPO, a method to align diffusion models to human
preferences by directly optimizing on human comparison data. Diffusion-DPO is
adapted from the recently developed Direct Preference Optimization (DPO), a
simpler alternative to RLHF which directly optimizes a policy that best
satisfies human preferences under a classification objective. We re-formulate
DPO to account for a diffusion model notion of likelihood, utilizing the
evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic
dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model
of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with
Diffusion-DPO. Our fine-tuned base model significantly outperforms both base
SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement
model in human evaluation, improving visual appeal and prompt alignment. We
also develop a variant that uses AI feedback and has comparable performance to
training on human preferences, opening the door for scaling of diffusion model
alignment methods. |
This paper introduces Diffusion-DPO, a new method for aligning text-to-image diffusion models with human preferences by directly optimizing the model on pairwise comparison data, adapting the Direct Preference Optimization (DPO) technique from language models. |
This paper is significant because it bridges the gap in aligning diffusion models to human preferences, similar to advancements made with Large Language Models (LLMs), leading to improved visual appeal and text alignment in generated images. |
The authors adapted DPO to diffusion models by defining a notion of data likelihood under the model and using the evidence lower bound (ELBO) to derive a differentiable objective. They demonstrate Diffusion-DPO by fine-tuning state-of-the-art text-to-image diffusion models like Stable Diffusion XL (SDXL) on the Pick-a-Pic dataset, and evaluating performance through human evaluation and automated metrics. |
Diffusion-DPO significantly improves both visual appeal and prompt alignment in generated images, outperforming even the larger SDXL model with a refinement stage. The authors also demonstrate the effectiveness of learning from AI feedback using Diffusion-DPO, offering a potential for scaling this alignment method. |
Limitations include ethical considerations related to potential biases in web-collected data and user preferences. Future work involves dataset cleaning and scaling, online learning methods for DPO, and personalized tuning for individual or group preferences. |
diffusion_model, dpo, alignment, human_preference, image_generation, ai_feedback, stable_diffusion, sdxl |
2310.10971 |
Context-Aware Meta-Learning |
Christopher Fifty, Dennis Duan, Ronald G. Junkins, Ehsan Amid, Jure Leskovec, Christopher Re, Sebastian Thrun |
Large Language Models like ChatGPT demonstrate a remarkable capacity to learn
new concepts during inference without any fine-tuning. However, visual models
trained to detect new objects during inference have been unable to replicate
this ability, and instead either perform poorly or require meta-training and/or
fine-tuning on similar objects. In this work, we propose a meta-learning
algorithm that emulates Large Language Models by learning new visual concepts
during inference without fine-tuning. Our approach leverages a frozen
pre-trained feature extractor, and analogous to in-context learning, recasts
visual meta-learning as sequence modeling over datapoints with known labels and
a test datapoint with an unknown label. On 8 out of 11 meta-learning
benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or
matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these
benchmarks. Our code is available at https://github.com/cfifty/CAML. |
This paper introduces Context-Aware Meta-Learning (CAML), a novel meta-learning algorithm for few-shot image classification that draws inspiration from in-context learning in Large Language Models (LLMs) to learn new visual concepts during inference without fine-tuning. |
This paper is important because it addresses the limitations of existing visual meta-learning algorithms that are either slow due to fine-tuning requirements or exhibit poor generalization to unseen tasks. The proposed CAML method offers a promising solution for real-time and generalizable few-shot image classification, potentially unlocking new applications in computer vision similar to the advancements in natural language processing enabled by in-context learning in LLMs. |
The authors propose a novel meta-learning algorithm, CAML, that leverages a frozen pre-trained feature extractor, an Equal Length and Maximally Equiangular Set (ELMES) class encoder, and a non-causal sequence model. The method encodes images and labels, forming a sequence that is processed by the non-causal sequence model to predict the query image's label. CAML is pre-trained on diverse few-shot image classification tasks, avoiding the need for meta-training or fine-tuning during inference. The authors theoretically demonstrate that using an ELMES class encoder maximizes the model's ability to identify classes within the support set. They evaluate CAML on 11 few-shot image classification benchmarks, comparing its performance against existing meta-learning methods in a universal setting. |
CAML achieves state-of-the-art performance in universal meta-learning, outperforming other baselines on 14 out of 22 evaluation settings. Remarkably, it performs comparably to P>M>F, the current best meta-learning algorithm, on 8 out of 11 benchmarks, even though P>M>F is meta-trained on the specific benchmark datasets. This suggests that visual in-context learning during inference can be as effective as meta-training on in-domain data. The paper also provides analysis showing CAML's capability to dynamically update representations based on the query and support set context, enabling it to perform well on diverse tasks. |
The paper acknowledges limitations in handling highly out-of-distribution images and varying image resolutions. Future work could focus on improving robustness in these areas. Additionally, the current implementation requires knowing the maximum number of classes during pre-training. Exploring methods to overcome this limitation and enable more flexible class handling during inference would be beneficial. |
diffusion_model, llm, analysis, few-shot learning, image classification, meta-learning, in-context learning, universal meta-learning |
2312.13286 |
Generative Multimodal Models are In-Context Learners |
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang |
The human ability to easily solve multimodal tasks in context (i.e., with
only a few demonstrations or simple instructions), is what current multimodal
systems have largely struggled to imitate. In this work, we demonstrate that
the task-agnostic in-context learning capabilities of large multimodal models
can be significantly enhanced by effective scaling-up. We introduce Emu2, a
generative multimodal model with 37 billion parameters, trained on large-scale
multimodal sequences with a unified autoregressive objective. Emu2 exhibits
strong multimodal in-context learning abilities, even emerging to solve tasks
that require on-the-fly reasoning, such as visual prompting and object-grounded
generation. The model sets a new record on multiple multimodal understanding
tasks in few-shot settings. When instruction-tuned to follow specific
instructions, Emu2 further achieves new state-of-the-art on challenging tasks
such as question answering benchmarks for large multimodal models and
open-ended subject-driven generation. These achievements demonstrate that Emu2
can serve as a base model and general-purpose interface for a wide range of
multimodal tasks. Code and models are publicly available to facilitate future
research. |
The paper introduces Emu++, a 37B parameter generative multimodal model trained on a massive dataset of text and image-text pairs, demonstrating strong in-context learning capabilities in multimodal tasks. |
This work is important as it presents a significant step towards building adaptable and general-purpose multimodal systems capable of solving diverse tasks with minimal task-specific training. |
The authors trained Emu++ using a unified autoregressive objective to predict the next multimodal element (visual embedding or text token) in a sequence, leveraging a large-scale dataset of text, image-text pairs, and interleaved image-text-video data. They further enhance the model for instruction following and controllable visual generation through instruction tuning on dedicated datasets. |
Emu++ achieves state-of-the-art performance on various multimodal benchmarks, including visual question answering, image captioning, and text-to-image generation. It exhibits strong few-shot learning capabilities, improving with more in-context examples. The model also demonstrates emergent abilities like visual prompting and object-grounded generation. |
The authors acknowledge limitations regarding potential biases in training data and the possibility of generating harmful content. Future work includes enhancing robustness, reducing hallucinations, improving fairness, and addressing the performance gap with closed multimodal systems in complex reasoning tasks. |
diffusion_model, llm, analysis, 3d, motion, video, interpretability |
2308.07926 |
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing |
Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen |
We present the content deformation field CoDeF as a new type of video
representation, which consists of a canonical content field aggregating the
static contents in the entire video and a temporal deformation field recording
the transformations from the canonical image (i.e., rendered from the canonical
content field) to each individual frame along the time axis.Given a target
video, these two fields are jointly optimized to reconstruct it through a
carefully tailored rendering pipeline.We advisedly introduce some
regularizations into the optimization process, urging the canonical content
field to inherit semantics (e.g., the object shape) from the video.With such a
design, CoDeF naturally supports lifting image algorithms for video processing,
in the sense that one can apply an image algorithm to the canonical image and
effortlessly propagate the outcomes to the entire video with the aid of the
temporal deformation field.We experimentally show that CoDeF is able to lift
image-to-image translation to video-to-video translation and lift keypoint
detection to keypoint tracking without any training.More importantly, thanks to
our lifting strategy that deploys the algorithms on only one image, we achieve
superior cross-frame consistency in processed videos compared to existing
video-to-video translation approaches, and even manage to track non-rigid
objects like water and smog.Project page can be found at
https://qiuyu96.github.io/CoDeF/. |
This paper introduces Content Deformation Fields (CoDeF), a novel video representation comprising a canonical content field for static content and a temporal deformation field tracking transformations. This representation facilitates applying image algorithms to videos for temporally consistent video processing. |
This paper is important as it bridges the gap between advanced image processing algorithms and video processing, offering a method for temporally consistent video editing and manipulation that surpasses previous techniques in quality and efficiency. |
The authors employ a 2D hash-based image field for the canonical content and a 3D hash-based field for temporal deformation, trained through a rendering pipeline. They introduce techniques like annealed hash encoding and flow-guided consistency loss to ensure semantic correctness and smoothness. The system is evaluated on tasks like video reconstruction, translation, keypoint tracking, object tracking, and super-resolution. |
CoDeF achieves superior video reconstruction quality with a 4.4 dB higher PSNR than Neural Image Atlas and significantly faster training (5 minutes vs. 10 hours). It effectively lifts image algorithms to video tasks, demonstrating superior temporal consistency in video-to-video translation, keypoint tracking on non-rigid objects, and object tracking compared to previous methods. |
The paper acknowledges limitations regarding per-scene optimization, challenges with extreme viewpoint changes, and handling large non-rigid deformations. Future work may explore feed-forward implicit field techniques, 3D prior knowledge integration, and using multiple canonical images to address these limitations. |
diffusion_model, video, motion, video_editing, representation_learning, temporal_consistency |
2308.07863 |
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models |
Zhizhong Wang, Lei Zhao, Wei Xing |
Content and style (C-S) disentanglement is a fundamental problem and critical
challenge of style transfer. Existing approaches based on explicit definitions
(e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable
nor easy to control, resulting in entangled representations and less satisfying
results. In this paper, we propose a new C-S disentangled framework for style
transfer without using previous assumptions. The key insight is to explicitly
extract the content information and implicitly learn the complementary style
information, yielding interpretable and controllable C-S disentanglement and
style transfer. A simple yet effective CLIP-based style disentanglement loss
coordinated with a style reconstruction prior is introduced to disentangle C-S
in the CLIP image space. By further leveraging the powerful style removal and
generative ability of diffusion models, our framework achieves superior results
than state of the art and flexible C-S disentanglement and trade-off control.
Our work provides new insights into the C-S disentanglement in style transfer
and demonstrates the potential of diffusion models for learning
well-disentangled C-S characteristics. |
This paper presents StyleDiffusion, a novel content-style disentangled framework for artistic style transfer that leverages diffusion models for explicit content extraction and implicit style learning, enabling interpretable and controllable style transfer. |
This paper is significant as it addresses limitations of existing style transfer methods that rely on explicit style definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) which often result in entangled representations. The proposed method achieves superior style transfer results with better content preservation, fine style details, and flexible disentanglement control. |
The authors introduce a diffusion-based style removal module to extract domain-aligned content information and a diffusion-based style transfer module to learn disentangled style from a single style image. A CLIP-based style disentanglement loss, combined with a style reconstruction prior, is used to guide the learning process in the CLIP image space. |
StyleDiffusion demonstrates impressive qualitative and quantitative results, outperforming SOTA methods in terms of content preservation (SSIM), style similarity (CLIP Score), and user preference. The framework offers flexible control over content-style disentanglement and trade-off at both training and testing stages by adjusting diffusion model parameters. It also exhibits potential for extensions such as photo-realistic style transfer, multi-modal style manipulation, and diversified style transfer. |
Limitations include the requirement for fine-tuning for each style, relatively slower inference due to diffusion models, and some failure cases like vanishing salient content or biased color distribution. Future work includes exploring arbitrary style transfer, accelerating diffusion sampling, and addressing the identified failure cases. Additionally, applying the framework to other image translation and manipulation tasks is another potential direction. |
diffusion_model, style_transfer, disentanglement, clip, analysis, image_manipulation, photorealistic, multi-modal |
2309.14564 |
Generative Escher Meshes |
Noam Aigerman, Thibault Groueix |
This paper proposes a fully-automatic, text-guided generative method for
producing periodic, repeating, tile-able 2D art, such as the one seen on
floors, mosaics, ceramics, and the work of M.C. Escher. In contrast to the
standard concept of a seamless texture, i.e., square images that are seamless
when tiled, our method generates non-square tilings which comprise solely of
repeating copies of the same object. It achieves this by optimizing both
geometry and color of a 2D mesh, in order to generate a non-square tile in the
shape and appearance of the desired object, with close to no additional
background details. We enable geometric optimization of tilings by our key
technical contribution: an unconstrained, differentiable parameterization of
the space of all possible tileable shapes for a given symmetry group. Namely,
we prove that modifying the laplacian used in a 2D mesh-mapping technique -
Orbifold Tutte Embedding - can achieve all possible tiling configurations for a
chosen planar symmetry group. We thus consider both the mesh's tile-shape and
its texture as optimizable parameters, rendering the textured mesh via a
differentiable renderer. We leverage a trained image diffusion model to define
a loss on the resulting image, thereby updating the mesh's parameters based on
its appearance matching the text prompt. We show our method is able to produce
plausible, appealing results, with non-trivial tiles, for a variety of
different periodic tiling patterns. |
This paper presents a novel method for generating tileable, non-square 2D art, similar to the works of M.C. Escher, by combining mesh deformation, texture optimization, and text-guided diffusion models. |
The ability to automatically generate appealing and complex tiling patterns has significant implications for various fields, including art, design, and architecture, while also offering a new approach to exploring the space of tileable shapes. |
The authors represent the tile as a textured 2D mesh and leverage Orbifold Tutte Embeddings (OTE) to ensure tileability while optimizing mesh vertices. They use a differentiable renderer to generate an image of the tile and apply Score Distillation Sampling (SDS) with a pre-trained diffusion model to guide the optimization towards matching a user-provided text prompt. |
The method successfully produces a wide variety of compelling tileable shapes with different symmetries, demonstrating its ability to generate complex and plausible imagery from text prompts while adhering to strict geometric constraints. |
Limitations include the restriction to wallpaper group tilings, difficulty in generating complex multi-object scenes, and reliance on SDS, which has limitations in speed, color saturation, and controllability. Future work could explore extensions to aperiodic tilings, multi-object tile generation, and integration with more advanced text-guided image generation techniques. |
diffusion_model, 2d, tiling, mesh, generative, text-guided, ote, sds |
2312.04410 |
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models |
Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi |
Recently, diffusion models have made remarkable progress in text-to-image
(T2I) generation, synthesizing images with high fidelity and diverse contents.
Despite this advancement, latent space smoothness within diffusion models
remains largely unexplored. Smooth latent spaces ensure that a perturbation on
an input latent corresponds to a steady change in the output image. This
property proves beneficial in downstream tasks, including image interpolation,
inversion, and editing. In this work, we expose the non-smoothness of diffusion
latent spaces by observing noticeable visual fluctuations resulting from minor
latent variations. To tackle this issue, we propose Smooth Diffusion, a new
category of diffusion models that can be simultaneously high-performing and
smooth. Specifically, we introduce Step-wise Variation Regularization to
enforce the proportion between the variations of an arbitrary input latent and
that of the output image is a constant at any diffusion training step. In
addition, we devise an interpolation standard deviation (ISTD) metric to
effectively assess the latent space smoothness of a diffusion model. Extensive
quantitative and qualitative experiments demonstrate that Smooth Diffusion
stands out as a more desirable solution not only in T2I generation but also
across various downstream tasks. Smooth Diffusion is implemented as a
plug-and-play Smooth-LoRA to work with various community models. Code is
available at https://github.com/SHI-Labs/Smooth-Diffusion. |
This paper introduces Smooth Diffusion, a novel diffusion model architecture that aims to improve the smoothness of the latent space in text-to-image generation tasks for enhanced performance in downstream tasks like image interpolation, inversion, and editing. |
This paper is important because it addresses the limitations of current diffusion models in terms of latent space smoothness, which hinder the quality of downstream tasks. By proposing Smooth Diffusion with a novel regularization technique, this work paves the way for higher-quality and more controllable image generation and manipulation. |
The authors propose Smooth Diffusion, which introduces Step-wise Variation Regularization to enforce a constant ratio between variations in input latent code and the output image at every training step. They train Smooth Diffusion on top of Stable Diffusion using the LAION Aesthetics 6.5+ dataset and a LoRA fine-tuning technique. To assess the smoothness, they propose a new metric, Interpolation Standard Deviation (ISTD), and compare Smooth Diffusion with Stable Diffusion and other state-of-the-art methods on various downstream tasks qualitatively and quantitively using metrics such as FID, CLIP Score, MSE, LPIPS, SSIM, and PSNR. |
Smooth Diffusion demonstrates significantly smoother latent space interpolation compared to Stable Diffusion, evidenced by lower ISTD scores and smoother visual transitions. Furthermore, Smooth Diffusion shows superior performance in image inversion and reconstruction, particularly when using DDIM inversion, and achieves better preservation of unedited content in both text-based and drag-based image editing tasks. |
The authors acknowledge that the effectiveness of the Smooth Diffusion's LoRA component, while adaptable to other models with the same architecture as Stable Diffusion, is not guaranteed and requires further investigation. Additionally, the paper suggests exploring the application of Smooth Diffusion to more challenging tasks, such as video generation, as a potential area for future work. |
diffusion_model, text-to-image, latent_space, smoothness, image_interpolation, image_inversion, image_editing, lora |
2311.03335 |
Cross-Image Attention for Zero-Shot Appearance Transfer |
Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or |
Recent advancements in text-to-image generative models have demonstrated a
remarkable ability to capture a deep semantic understanding of images. In this
work, we leverage this semantic knowledge to transfer the visual appearance
between objects that share similar semantics but may differ significantly in
shape. To achieve this, we build upon the self-attention layers of these
generative models and introduce a cross-image attention mechanism that
implicitly establishes semantic correspondences across images. Specifically,
given a pair of images -- one depicting the target structure and the other
specifying the desired appearance -- our cross-image attention combines the
queries corresponding to the structure image with the keys and values of the
appearance image. This operation, when applied during the denoising process,
leverages the established semantic correspondences to generate an image
combining the desired structure and appearance. In addition, to improve the
output image quality, we harness three mechanisms that either manipulate the
noisy latent codes or the model's internal representations throughout the
denoising process. Importantly, our approach is zero-shot, requiring no
optimization or training. Experiments show that our method is effective across
a wide range of object categories and is robust to variations in shape, size,
and viewpoint between the two input images. |
This paper presents a zero-shot approach for transferring visual appearance between objects in different images, leveraging the semantic knowledge encoded within pretrained text-to-image diffusion models. |
This paper is significant because it offers a novel method for appearance transfer that doesn't require training a new model or per-image optimization, unlike existing approaches. It leverages the power of pretrained diffusion models and their ability to capture semantic correspondences between images, even across different object categories. |
The authors introduce a 'Cross-Image Attention' mechanism that replaces the standard self-attention layers within the denoising network of a diffusion model. By combining queries from the structure image with keys and values from the appearance image, the model implicitly learns to transfer visual features. To improve the transfer quality, they employ techniques like attention map contrasting, appearance guidance, and AdaIN normalization. |
The paper demonstrates high-quality appearance transfer results across various object domains, including challenging cases with variations in object shape, viewpoint, and even different object categories. Qualitative and quantitative comparisons with existing techniques like Swapping Autoencoders, SpliceVIT, and DiffuseIT show that their method achieves a better balance between structure preservation and accurate appearance transfer. A user study further confirms these findings, highlighting the superior quality and appearance fidelity of the generated images. |
The authors acknowledge limitations related to the model's ability to establish accurate correspondences, especially between semantically dissimilar objects. Additionally, the success of the transfer relies on accurate inversion of input images into the diffusion model's latent space, which can be sensitive to the inversion process and random seeds. Future work could focus on improving the robustness of cross-domain transfer and enhancing the inversion techniques for more reliable and editable latent codes. |
diffusion_model, appearance_transfer, semantic_correspondence, zero-shot, image_manipulation, self-attention, denoising_diffusion_model |
2405.04404 |
Vision Mamba: A Comprehensive Survey and Taxonomy |
Xiao Liu, Chenxu Zhang, Lei Zhang |
State Space Model (SSM) is a mathematical model used to describe and analyze
the behavior of dynamic systems. This model has witnessed numerous applications
in several fields, including control theory, signal processing, economics and
machine learning. In the field of deep learning, state space models are used to
process sequence data, such as time series analysis, natural language
processing (NLP) and video understanding. By mapping sequence data to state
space, long-term dependencies in the data can be better captured. In
particular, modern SSMs have shown strong representational capabilities in NLP,
especially in long sequence modeling, while maintaining linear time complexity.
Notably, based on the latest state-space models, Mamba merges time-varying
parameters into SSMs and formulates a hardware-aware algorithm for efficient
training and inference. Given its impressive efficiency and strong long-range
dependency modeling capability, Mamba is expected to become a new AI
architecture that may outperform Transformer. Recently, a number of works have
attempted to study the potential of Mamba in various fields, such as general
vision, multi-modal, medical image analysis and remote sensing image analysis,
by extending Mamba from natural language domain to visual domain. To fully
understand Mamba in the visual domain, we conduct a comprehensive survey and
present a taxonomy study. This survey focuses on Mamba's application to a
variety of visual tasks and data types, and discusses its predecessors, recent
advances and far-reaching impact on a wide range of domains. Since Mamba is now
on an upward trend, please actively notice us if you have new findings, and new
progress on Mamba will be included in this survey in a timely manner and
updated on the Mamba project at
https://github.com/lx6c78/Vision-Mamba-A-Comprehensive-Survey-and-Taxonomy. |
This paper presents a comprehensive survey of Mamba, a novel deep learning architecture based on state space models (SSMs), and its applications in various computer vision tasks. |
This survey is important because it provides a timely and comprehensive overview of Mamba, which is rapidly gaining traction in the computer vision community as a more efficient alternative to Transformers and CNNs, particularly for processing long sequences and high-resolution images. |
The authors conduct their research by reviewing existing literature on Mamba and categorizing its variants based on their application in different vision tasks, including general vision, multi-modal learning, and vertical domains like remote sensing and medical image analysis. |
The paper highlights the successful implementation of Mamba across a wide spectrum of vision tasks, showcasing its superior performance in terms of efficiency, accuracy, and memory usage compared to traditional architectures. Key results include state-of-the-art performance achieved by Mamba variants in image classification, object detection, semantic segmentation, image restoration, 3D vision, and multi-modal tasks. |
The authors identify several limitations and future research directions for Mamba, including the need for new scanning mechanisms to better handle the non-causal nature of visual data, the exploration of synergistic hybrid architectures combining Mamba with other approaches like Transformers, the development of large-scale Mamba models, and its integration with other methodologies such as diffusion models and domain generalization. |
state_space_model, mamba, computer_vision, image_classification, object_detection, semantic_segmentation, image_restoration, 3d, multi-modal, remote_sensing, medical_image_analysis, survey, literature_review |
2403.12931 |
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs |
Yihong Luo, Xiaolong Chen, Jing Tang |
We introduce YOSO, a novel generative model designed for rapid, scalable, and
high-fidelity one-step image synthesis. This is achieved by integrating the
diffusion process with GANs. Specifically, we smooth the distribution by the
denoising generator itself, performing self-cooperative learning. We show that
our method can serve as a one-step generation model training from scratch with
competitive performance. Moreover, we show that our method can be extended to
finetune pre-trained text-to-image diffusion for high-quality one-step
text-to-image synthesis even with LoRA fine-tuning. In particular, we provide
the first diffusion transformer that can generate images in one step trained on
512 resolution, with the capability of adapting to 1024 resolution without
explicit training. Our code is provided at https://github.com/Luo-Yihong/YOSO. |
This paper introduces YOSO, a novel one-step image synthesis model that integrates diffusion models with Generative Adversarial Networks (GANs) for rapid, scalable, and high-fidelity image generation. |
This paper is important because it addresses the limitations of traditional diffusion models, which require iterative denoising and suffer from slow generation speed. YOSO offers a solution by enabling one-step generation without compromising image quality, making it highly relevant for practical applications. |
The authors propose a self-cooperative learning approach where the generator learns from itself by matching the distribution of generated samples at different levels of corruption. They also introduce several techniques for text-to-image generation, including latent perceptual loss, latent discriminator, and fixing the noise scheduler. |
YOSO achieves competitive performance on unconditional image generation, outperforming other one-step methods and even rivaling multi-step diffusion models. In text-to-image generation, YOSO demonstrates superior image quality, prompt alignment, and mode cover compared to state-of-the-art one-step models like SD-Turbo and SDXL-Turbo. Notably, YOSO-LoRA, a fine-tuned version, achieves impressive results with only LoRA fine-tuning, showcasing its efficiency. Furthermore, YOSO exhibits promising compatibility with downstream tasks such as image-to-image editing and ControlNet. |
The authors acknowledge limitations in fine-tuning on datasets different from the pre-trained model's training set, leading to distribution shift. They suggest training on larger and more diverse datasets like LAION to address this issue. Additionally, exploring more advanced noise scheduler adaptation techniques and expanding YOSO's application in various downstream tasks are highlighted as future work. |
diffusion_model, gan, image_synthesis, one-step_generation, text-to-image, lora, self-cooperative_learning, latent_perceptual_loss, latent_discriminator |
2311.15127 |
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets |
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach |
We present Stable Video Diffusion - a latent video diffusion model for
high-resolution, state-of-the-art text-to-video and image-to-video generation.
Recently, latent diffusion models trained for 2D image synthesis have been
turned into generative video models by inserting temporal layers and finetuning
them on small, high-quality video datasets. However, training methods in the
literature vary widely, and the field has yet to agree on a unified strategy
for curating video data. In this paper, we identify and evaluate three
different stages for successful training of video LDMs: text-to-image
pretraining, video pretraining, and high-quality video finetuning. Furthermore,
we demonstrate the necessity of a well-curated pretraining dataset for
generating high-quality videos and present a systematic curation process to
train a strong base model, including captioning and filtering strategies. We
then explore the impact of finetuning our base model on high-quality data and
train a text-to-video model that is competitive with closed-source video
generation. We also show that our base model provides a powerful motion
representation for downstream tasks such as image-to-video generation and
adaptability to camera motion-specific LoRA modules. Finally, we demonstrate
that our model provides a strong multi-view 3D-prior and can serve as a base to
finetune a multi-view diffusion model that jointly generates multiple views of
objects in a feedforward fashion, outperforming image-based methods at a
fraction of their compute budget. We release code and model weights at
https://github.com/Stability-AI/generative-models . |
This paper introduces Stable Video Diffusion (SVD), a latent diffusion model for generating high-resolution videos from text or image prompts. |
This paper addresses the lack of focus on data selection in video generation research by demonstrating the significant impact of systematic data curation on the quality of generated videos, leading to state-of-the-art results in text-to-video and image-to-video synthesis. |
The authors develop a three-stage training strategy: 1) Image pretraining using Stable Diffusion 2.1, 2) Video pretraining on a large, curated dataset at low resolution, and 3) High-resolution video finetuning on a smaller, high-quality dataset. They also employ techniques like EDM-preconditioning, classifier-free guidance, and temporal attention layers. |
The resulting SVD model excels at generating high-resolution videos from text and image prompts, outperforming existing models in quality and motion representation. It also demonstrates strong multi-view consistency, making it suitable for multi-view synthesis with superior results compared to specialized methods like Zero123XL and SyncDreamer. |
While successful in short video generation, SVD faces limitations in long-form video synthesis due to computational costs and occasional lack of motion in generated videos. Future work could explore cascaded frame generation, dedicated video tokenizers, and diffusion distillation for faster inference and long-form generation. |
diffusion_model, video, text-to-video, image-to-video, 3d, motion, multi-view, data_curation |
2405.04795 |
Variational Schrödinger Diffusion Models |
Wei Deng, Weijian Luo, Yixin Tan, Marin Biloš, Yu Chen, Yuriy Nevmyvaka, Ricky T. Q. Chen |
Schr\"odinger bridge (SB) has emerged as the go-to method for optimizing
transportation plans in diffusion models. However, SB requires estimating the
intractable forward score functions, inevitably resulting in the costly
implicit training loss based on simulated trajectories. To improve the
scalability while preserving efficient transportation plans, we leverage
variational inference to linearize the forward score functions (variational
scores) of SB and restore simulation-free properties in training backward
scores. We propose the variational Schr\"odinger diffusion model (VSDM), where
the forward process is a multivariate diffusion and the variational scores are
adaptively optimized for efficient transport. Theoretically, we use stochastic
approximation to prove the convergence of the variational scores and show the
convergence of the adaptively generated samples based on the optimal
variational scores. Empirically, we test the algorithm in simulated examples
and observe that VSDM is efficient in generations of anisotropic shapes and
yields straighter sample trajectories compared to the single-variate diffusion.
We also verify the scalability of the algorithm in real-world data and achieve
competitive unconditional generation performance in CIFAR10 and conditional
generation in time series modeling. Notably, VSDM no longer depends on warm-up
initializations and has become tuning-friendly in training large-scale
experiments. |
This paper presents Variational Schr\"odinger Diffusion Model (VSDM), a novel diffusion model that leverages variational inference to enhance the scalability of Schr\"odinger bridge (SB) for optimizing transportation plans, while preserving efficient transport. |
While SB offers optimal transport guarantees, it faces scalability limitations due to the need for costly simulated trajectories. VSDM overcomes this by linearizing forward score functions, leading to closed-form updates and enabling simulation-free training of backward score functions. This enhances scalability and makes the algorithm more tuning-friendly for large-scale experiments. |
The authors employ variational inference to approximate the forward score function in SB using a locally linear function, leading to the variational FB-SDE. They then utilize a multivariate OU process for the forward diffusion and derive closed-form expressions for the backward score function. They also use stochastic approximation to adaptively optimize the variational score for efficient transport. |
VSDM demonstrates effectiveness in generating anisotropic shapes and produces straighter sample trajectories, indicating more efficient transport, compared to single-variate diffusions. It achieves competitive performance in image generation on CIFAR10 and conditional time series modeling, all without relying on warm-up initializations. Furthermore, VSDM is observed to be significantly faster than the original SB with nonlinear forward scores. |
The paper acknowledges that linearizing the forward score function inevitably results in sub-optimal transport in general cases. Future work includes exploring critically damped (momentum) acceleration and Hessian approximations to develop advanced optimization techniques akin to "ADAM" for diffusion models. |
diffusion_model, optimal_transport, variational_inference, stochastic_approximation, schrodinger_bridge, simulation-free, image_generation, time_series_forecasting |
2311.13231 |
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model |
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li |
Using reinforcement learning with human feedback (RLHF) has shown significant
promise in fine-tuning diffusion models. Previous methods start by training a
reward model that aligns with human preferences, then leverage RL techniques to
fine-tune the underlying models. However, crafting an efficient reward model
demands extensive datasets, optimal architecture, and manual hyperparameter
tuning, making the process both time and cost-intensive. The direct preference
optimization (DPO) method, effective in fine-tuning large language models,
eliminates the necessity for a reward model. However, the extensive GPU memory
requirement of the diffusion model's denoising process hinders the direct
application of the DPO method. To address this issue, we introduce the Direct
Preference for Denoising Diffusion Policy Optimization (D3PO) method to
directly fine-tune diffusion models. The theoretical analysis demonstrates that
although D3PO omits training a reward model, it effectively functions as the
optimal reward model trained using human feedback data to guide the learning
process. This approach requires no training of a reward model, proving to be
more direct, cost-effective, and minimizing computational overhead. In
experiments, our method uses the relative scale of objectives as a proxy for
human preference, delivering comparable results to methods using ground-truth
rewards. Moreover, D3PO demonstrates the ability to reduce image distortion
rates and generate safer images, overcoming challenges lacking robust reward
models. Our code is publicly available at https://github.com/yk7333/D3PO. |
This paper introduces D3PO, a novel method for directly fine-tuning diffusion models using human feedback without relying on a separate reward model, addressing the limitations of traditional RLHF methods in this domain. |
This research is important because it offers a more efficient and cost-effective approach to aligning diffusion models with human preferences, potentially impacting diverse applications like image generation, by eliminating the resource-intensive task of training a separate reward model. |
The authors reinterpret the denoising process of diffusion models as a multi-step Markov Decision Process (MDP). They then extend the Direct Preference Optimization (DPO) framework, originally designed for Large Language Models, to this MDP. This allows them to directly update the model's policy based on human preferences, bypassing the need for a reward model. |
D3PO demonstrated comparable or superior performance to methods relying on reward models in tasks like image compressibility and aesthetic quality. It also proved effective in challenging scenarios without a reward model, successfully reducing image distortions, enhancing image safety, and improving prompt-image alignment. |
The paper acknowledges the limitations stemming from assumptions like the normality of expected return and the use of relative reward sizes. Future work may explore relaxing these assumptions and investigating the effectiveness of D3PO in more complex real-world applications. |
diffusion_model, rlhf, dpo, image_generation, human_feedback, image_quality, safety, prompt-image_alignment |
2404.07554 |
CAT: Contrastive Adapter Training for Personalized Image Generation |
Jae Wan Park, Sang Hyun Park, Jun Young Koh, Junha Lee, Min Song |
The emergence of various adapters, including Low-Rank Adaptation (LoRA)
applied from the field of natural language processing, has allowed diffusion
models to personalize image generation at a low cost. However, due to the
various challenges including limited datasets and shortage of regularization
and computation resources, adapter training often results in unsatisfactory
outcomes, leading to the corruption of the backbone model's prior knowledge.
One of the well known phenomena is the loss of diversity in object generation,
especially within the same class which leads to generating almost identical
objects with minor variations. This poses challenges in generation
capabilities. To solve this issue, we present Contrastive Adapter Training
(CAT), a simple yet effective strategy to enhance adapter training through the
application of CAT loss. Our approach facilitates the preservation of the base
model's original knowledge when the model initiates adapters. Furthermore, we
introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to
keep the former information. We qualitatively and quantitatively compare CAT's
improvement. Finally, we mention the possibility of CAT in the aspects of
multi-concept adapter and optimization. |
This paper introduces CAT (Contrastive Adapter Training), a method for personalized image generation using diffusion models that leverages a contrastive loss function to preserve the base model's knowledge while training adapters, improving upon existing methods like LoRA and Dreambooth. |
The paper addresses the limitations of current personalized image generation techniques, which often lead to knowledge corruption and underfitting in diffusion models, by proposing a novel training pipeline that combines contrastive learning with adapter training, resulting in better preservation of the original model's capabilities and more diverse and controllable generation. |
The authors propose CAT, which adds a contrastive loss term to the adapter training objective. This loss encourages the adapted model's noise predictions to be similar to the original model's predictions when no trigger token is present, ensuring the preservation of the base model's knowledge. The method is evaluated using established metrics like prompt similarity and identity similarity, alongside a newly introduced metric called Knowledge Preservation Score (KPS) to quantify knowledge retention. |
CAT outperforms existing adapter training methods in preserving the original model’s knowledge while achieving comparable identity generation fidelity. This is demonstrated through quantitative results using metrics like KPS and qualitative comparisons of generated images, showcasing CAT's ability to maintain diversity and avoid mode collapse. |
The paper acknowledges limitations in evaluating diversity and fidelity due to the instability of CLIP-based scores and the lack of investigation into the impact of domain discrepancies between the model and training data. Future work aims to establish a reliable benchmark for consistent character generation, explore the impact of CAT's structure and application more thoroughly, and expand CAT to support multi-concept training with per-token loss for enhanced multi-concept generation. |
diffusion_model, adapter, lora, dreambooth, personalization, image_generation, contrastive_learning, knowledge_preservation |
2312.02116 |
GIVT: Generative Infinite-Vocabulary Transformers |
Michael Tschannen, Cian Eastwood, Fabian Mentzer |
We introduce generative infinite-vocabulary transformers (GIVT) which
generate vector sequences with real-valued entries, instead of discrete tokens
from a finite vocabulary. To this end, we propose two surprisingly simple
modifications to decoder-only transformers: 1) at the input, we replace the
finite-vocabulary lookup table with a linear projection of the input vectors;
and 2) at the output, we replace the logits prediction (usually mapped to a
categorical distribution) with the parameters of a multivariate Gaussian
mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT,
where transformers are used to model the discrete latent sequences of a VQ-VAE,
we use GIVT to model the unquantized real-valued latent sequences of a
$\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and
improved variants thereof) as well as MaskGIT, and achieves performance
competitive with recent latent diffusion models. Finally, we obtain strong
results outside of image generation when applying GIVT to panoptic segmentation
and depth estimation with a VAE variant of the UViM framework |
This paper introduces GIVT (Generative Infinite-Vocabulary Transformer), a novel transformer decoder-only architecture capable of generating sequences of real-valued vectors, eliminating the need for quantization used in previous methods like VQ-GAN and MaskGIT. |
This work is significant as it presents the first successful attempt at utilizing transformer decoders for generating continuous, unquantized vector sequences, thereby avoiding limitations associated with VQ-based methods. It paves the way for more efficient and higher-quality image generation and representation learning, while also being directly applicable to multimodal interleaved modeling. |
The authors modify the standard transformer decoder architecture by replacing the input embedding lookup table with a linear projection layer for real-valued vectors and predicting parameters of a Gaussian Mixture Model (GMM) at the output. They train GIVT on the latent space of a \beta-VAE using teacher forcing and masked language modeling approaches, exploring various sampling techniques like temperature sampling, beam search, and a novel distribution-based classifier-free guidance (DB-CFG). |
GIVT outperforms VQ-GAN, MaskGIT, and some diffusion models in class-conditional image generation on ImageNet, achieving comparable image quality with a smaller model size and faster sampling. Notably, GIVT demonstrates competitive performance in representation learning and dense prediction tasks like panoptic segmentation and depth estimation using the UViM framework. |
Limitations include the challenge of end-to-end training of VAE and GIVT, which is left for future work. The authors suggest exploring applications of GIVT to other data modalities like audio and time-series modeling. |
diffusion_model, gan, vae, transformer, image_generation, representation_learning, panoptic_segmentation, depth_estimation, gmm, classifier-free_guidance |
2401.05293 |
Score Distillation Sampling with Learned Manifold Corrective |
Thiemo Alldieck, Nikos Kolotouros, Cristian Sminchisescu |
Score Distillation Sampling (SDS) is a recent but already widely popular
method that relies on an image diffusion model to control optimization problems
using text prompts. In this paper, we conduct an in-depth analysis of the SDS
loss function, identify an inherent problem with its formulation, and propose a
surprisingly easy but effective fix. Specifically, we decompose the loss into
different factors and isolate the component responsible for noisy gradients. In
the original formulation, high text guidance is used to account for the noise,
leading to unwanted side effects. Instead, we train a shallow network mimicking
the timestep-dependent denoising deficiency of the image diffusion model in
order to effectively factor it out. We demonstrate the versatility and the
effectiveness of our novel loss formulation through several qualitative and
quantitative experiments, including optimization-based image synthesis and
editing, zero-shot image translation network training, and text-to-3D
synthesis. |
This paper presents an analysis of the Score Distillation Sampling (SDS) loss function, identifies a noise issue in its gradients, and proposes a solution called Learned Manifold Corrective SDS (LMC-SDS) to improve gradient quality and reduce reliance on high guidance weights. |
This paper is important because it addresses limitations of SDS, a popular method for using pre-trained diffusion models as priors in various tasks like image synthesis, editing, and 3D generation. By improving the SDS loss, it enables more stable optimization, better image fidelity, and wider applicability. |
The authors decompose the SDS loss, identify a problematic term causing noisy gradients, and propose LMC-SDS to model and factor out the time-step dependent image corruption in the denoising process. They train a shallow network to approximate this corruption and use it to correct the gradients, promoting movement towards the manifold of natural images. They demonstrate LMC-SDS effectiveness through qualitative and quantitative experiments on image synthesis, editing, image translation network training, and 3D asset generation. |
The proposed LMC-SDS loss leads to: 1) More stable optimization with less reliance on high guidance weights, resulting in less saturated colors and fewer artifacts. 2) Higher fidelity results in image synthesis and editing tasks, better preserving image structure while achieving significant edits. 3) Improved performance in training image-to-image translation networks, as demonstrated by the 'cats-to-others' experiment. 4) Enhanced detail and reduced Janus problem in 3D asset generation using DreamFusion. |
The paper acknowledges limitations in LMC-SDS, where it might not perform well if the diffusion model doesn't understand the prompt or if the optimization strays too far from the natural image manifold. Future work includes further improving the manifold corrective and applying the findings to specific applications like text-to-3D and image editing. |
diffusion_model, analysis, image_synthesis, image_editing, 3d, text-to-3d, optimization, loss_function, denoising |
2402.15120 |
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing |
Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang |
Contrastive language-image pre-training (CLIP) models have demonstrated
considerable success across various vision-language tasks, such as
text-to-image retrieval, where the model is required to effectively process
natural language input to produce an accurate visual output. However, current
models still face limitations in dealing with linguistic variations in input
queries, such as paraphrases, making it challenging to handle a broad range of
user queries in real-world applications. In this study, we introduce a
straightforward fine-tuning approach to enhance the representations of CLIP
models for paraphrases. Our approach involves a two-step paraphrase generation
process, where we automatically create two categories of paraphrases from
web-scale image captions by leveraging large language models. Subsequently, we
fine-tune the CLIP text encoder using these generated paraphrases while
freezing the image encoder. Our resulting model, which we call ParaCLIP,
exhibits significant improvements over baseline CLIP models across various
tasks, including paraphrased retrieval (with rank similarity scores improved by
up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven
semantic textual similarity tasks. |
This paper presents ParaCLIP, a fine-tuning approach for CLIP models that enhances their understanding and handling of paraphrased text inputs by leveraging synthetic paraphrases generated from large language models. |
This work addresses the challenge of linguistic variation in text inputs for vision-language tasks, which limits the robustness of existing CLIP models in real-world applications. ParaCLIP improves the representation of paraphrases in CLIP's text encoder, leading to better performance in tasks requiring semantic understanding and compositionality. |
The authors propose a two-step paraphrasing process using LLMs (ChatGPT, LLaMA) to generate two categories of paraphrases for image captions. Then, they fine-tune the CLIP text encoder with these paraphrases while keeping the image encoder frozen. The training objective consists of three InfoNCE losses: image-paraphrase, caption-paraphrase, and paraphrase-paraphrase. |
ParaCLIP consistently outperforms baseline CLIP models in tasks like paraphrased retrieval, Visual Genome Relation and Attribution, and semantic textual similarity. Notably, it significantly improves average overlap and Jaccard similarity scores in paraphrased retrieval, indicating better handling of linguistic variations. The ablation study highlights the importance of each loss function in achieving balanced performance across different tasks. |
The authors acknowledge that their method may sometimes degrade performance on standard vision and vision-language tasks like zero-shot classification and image retrieval, possibly due to limitations in computational resources to use large batch sizes during fine-tuning. Future work involves investigating factors contributing to this performance degradation and exploring the potential of the approach to address compositional understanding limitations in CLIP models. |
clip, paraphrase, fine-tuning, llm, vision-language, image_retrieval, semantic_textual_similarity |
2311.17009 |
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer |
Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel |
We present a new method for text-driven motion transfer - synthesizing a
video that complies with an input text prompt describing the target objects and
scene while maintaining an input video's motion and scene layout. Prior methods
are confined to transferring motion across two subjects within the same or
closely related object categories and are applicable for limited domains (e.g.,
humans). In this work, we consider a significantly more challenging setting in
which the target and source objects differ drastically in shape and
fine-grained motion characteristics (e.g., translating a jumping dog into a
dolphin). To this end, we leverage a pre-trained and fixed text-to-video
diffusion model, which provides us with generative and motion priors. The
pillar of our method is a new space-time feature loss derived directly from the
model. This loss guides the generation process to preserve the overall motion
of the input video while complying with the target object in terms of shape and
fine-grained motion traits. |
This paper introduces a novel method for text-driven motion transfer in videos, enabling the transfer of motion from a source video to a target object specified by a text prompt, even when the source and target objects have significant differences in shape and motion characteristics. |
This paper pushes the boundaries of motion transfer beyond previous methods limited to similar object categories. It offers a zero-shot approach, leveraging the generative capabilities of pre-trained text-to-video diffusion models for a more versatile and accessible motion transfer solution. |
The authors analyze the space-time features learned by a text-to-video diffusion model and introduce a novel loss function based on pairwise differences of spatial marginal mean features. This loss guides the generation process to preserve motion characteristics while accommodating significant structural deviations between source and target objects. |
The proposed method demonstrates state-of-the-art performance in preserving motion fidelity while adhering to the target text prompt. It outperforms existing methods in qualitative and quantitative comparisons, showcasing successful motion transfer across diverse object categories with significant shape variations. User studies further confirm the superiority of the generated videos, highlighting their improved quality and adherence to the target prompts. |
The method's reliance on the pre-trained text-to-video model's generative capabilities poses limitations. The model's training data might not encompass all possible object-motion combinations, leading to reduced motion fidelity or artifacts. Future work could explore larger and more diverse training datasets for text-to-video models and investigate alternative optimization strategies to further enhance motion fidelity in challenging cases. |
diffusion_model, motion, video, text-to-video, motion_transfer, zero-shot |
2405.04517 |
xLSTM: Extended Long Short-Term Memory |
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter |
In the 1990s, the constant error carousel and gating were introduced as the
central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have
stood the test of time and contributed to numerous deep learning success
stories, in particular they constituted the first Large Language Models (LLMs).
However, the advent of the Transformer technology with parallelizable
self-attention at its core marked the dawn of a new era, outpacing LSTMs at
scale. We now raise a simple question: How far do we get in language modeling
when scaling LSTMs to billions of parameters, leveraging the latest techniques
from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we
introduce exponential gating with appropriate normalization and stabilization
techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM
with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that
is fully parallelizable with a matrix memory and a covariance update rule.
Integrating these LSTM extensions into residual block backbones yields xLSTM
blocks that are then residually stacked into xLSTM architectures. Exponential
gating and modified memory structures boost xLSTM capabilities to perform
favorably when compared to state-of-the-art Transformers and State Space
Models, both in performance and scaling. |
The paper introduces Extended Long Short-Term Memory (xLSTM), a novel recurrent neural network architecture that builds upon the original LSTM by introducing exponential gating with memory mixing and a new memory structure, achieving comparable and in many cases better performance than Transformers and State Space Models in language modeling. |
This paper is important because it revives the LSTM for large language models, showing that LSTMs, when properly scaled and enhanced, can compete with and even surpass the performance of dominant architectures like Transformers and State Space Models, potentially impacting various deep learning fields. |
The authors introduce two new LSTM variants: sLSTM with exponential gating and a scalar memory, and mLSTM with exponential gating and a matrix memory using a covariance update rule. They integrate these variants into residual blocks, stack them to form xLSTM architectures, and evaluate them on synthetic tasks, the Long Range Arena, and language modeling benchmarks (SlimPajama and PALOMA). |
The xLSTM architecture outperforms state-of-the-art Transformers, State Space Models, and RNNs in most experiments. Notably, xLSTM excels at sequence length extrapolation, consistently maintaining low perplexity even for longer contexts unseen during training, exhibits superior memory capacity in associative recall tasks, demonstrates strong performance on the Long Range Arena, and achieves state-of-the-art results in perplexity and downstream tasks on both SlimPajama and PALOMA language modeling benchmarks. |
Limitations: sLSTM lacks parallelizability due to memory mixing; current CUDA kernels for mLSTM are not fully optimized; mLSTM's matrix memory has high computational complexity; initialization of forget gates requires careful consideration; longer context sizes might overload the matrix memory. Future work: optimizing CUDA kernels for both sLSTM and mLSTM; exploring alternative memory structures with lower computational complexity; extensive architecture and hyperparameter optimization for larger xLSTM models; application of xLSTM to other deep learning domains beyond language modeling. |
lstm, language_model, llm, rnn, transformer, state_space_model, gating, memory, analysis, scaling_law, sequence_length_extrapolation |
2404.11614 |
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior |
Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, Huamin Qu |
Text animation serves as an expressive medium, transforming static
communication into dynamic experiences by infusing words with motion to evoke
emotions, emphasize meanings, and construct compelling narratives. Crafting
animations that are semantically aware poses significant challenges, demanding
expertise in graphic design and animation. We present an automated text
animation scheme, termed "Dynamic Typography", which combines two challenging
tasks. It deforms letters to convey semantic meaning and infuses them with
vibrant movements based on user prompts. Our technique harnesses vector
graphics representations and an end-to-end optimization-based framework. This
framework employs neural displacement fields to convert letters into base
shapes and applies per-frame motion, encouraging coherence with the intended
textual concept. Shape preservation techniques and perceptual loss
regularization are employed to maintain legibility and structural integrity
throughout the animation process. We demonstrate the generalizability of our
approach across various text-to-video models and highlight the superiority of
our end-to-end methodology over baseline methods, which might comprise separate
tasks. Through quantitative and qualitative evaluations, we demonstrate the
effectiveness of our framework in generating coherent text animations that
faithfully interpret user prompts while maintaining readability. Our code is
available at: https://animate-your-word.github.io/demo/. |
This paper introduces "Dynamic Typography," a method for animating individual letters within words by deforming them to embody semantic meaning and infusing them with vivid movements based on user prompts. |
This paper is important because it automates the creation of expressive and semantically aware text animations, a task traditionally requiring significant expertise in graphic design and animation. This approach makes text animation more accessible and efficient. |
The authors use an end-to-end optimization-based framework that leverages vector graphics representations of letters. They employ neural displacement fields to deform letters into base shapes and apply per-frame motion guided by a pre-trained text-to-video model. They ensure legibility and structural integrity using perceptual loss regularization and shape preservation techniques. |
The proposed method generates consistent and prompt-aware text animations while preserving legibility, outperforming baseline methods in quantitative and qualitative evaluations. The authors demonstrate the generalizability of their approach across various text-to-video models. |
The authors acknowledge limitations regarding the motion quality being bounded by the capabilities of the video foundation model. Future work could explore incorporating future advancements in diffusion-based video foundation models. Additionally, challenges remain when user prompts significantly deviate from the original letter shapes, requiring further research to balance semantic representation with legibility. |
diffusion_model, animation, text-to-video, kinetic typography, svg, interpretability |
2308.14761 |
Unified Concept Editing in Diffusion Models |
Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau |
Text-to-image models suffer from various safety issues that may limit their
suitability for deployment. Previous methods have separately addressed
individual issues of bias, copyright, and offensive content in text-to-image
models. However, in the real world, all of these issues appear simultaneously
in the same model. We present a method that tackles all issues with a single
approach. Our method, Unified Concept Editing (UCE), edits the model without
training using a closed-form solution, and scales seamlessly to concurrent
edits on text-conditional diffusion models. We demonstrate scalable
simultaneous debiasing, style erasure, and content moderation by editing
text-to-image projections, and we present extensive experiments demonstrating
improved efficacy and scalability over prior work. Our code is available at
https://unified.baulab.info |
This paper introduces Unified Concept Editing (UCE), a closed-form model editing method for text-to-image diffusion models that can erase, moderate, and debias multiple concepts simultaneously without retraining. |
This work addresses limitations in existing methods that handle bias, copyright, and offensive content separately in text-to-image models. UCE provides a unified, efficient, and scalable solution to tackle these issues concurrently, paving the way for safer and more responsible deployment of these models. |
UCE builds upon prior model editing techniques like TIME and MEMIT, generalizing their closed-form weight update solutions for linear projection layers in diffusion models. By directly modifying cross-attention weights, it aligns text embeddings to manipulate concept generation. The method employs different target output strategies for each edit type: erasing associates concepts with different outputs, debiasing adjusts attribute magnitudes, and moderation replaces outputs with generic responses. |
UCE demonstrates superior performance in erasing artistic styles while minimizing interference with unrelated concepts, outperforming baselines like ESD and Concept Ablation. It effectively debiases gender and racial biases in profession representations, surpassing existing methods in achieving balanced attribute distributions. Additionally, UCE exhibits comparable or better NSFW content moderation capabilities compared to ESD, while maintaining higher image quality and text-image alignment. |
The authors acknowledge limitations in addressing compounding biases when debiasing across multiple attributes, as well as challenges posed by compositional bias effects in prompts. They also note that excessive artistic style erasures can degrade overall model performance, suggesting a need to preserve a critical mass of artistic knowledge. Future work could focus on mitigating these limitations, exploring joint attribute debiasing, and developing techniques to handle compositional bias. |
diffusion_model, gan, analysis, adversarial_attack, interpretability, debias, erasure, moderation |
2312.07991 |
Accelerating the Global Aggregation of Local Explanations |
Alon Mor, Yonatan Belinkov, Benny Kimelfeld |
Local explanation methods highlight the input tokens that have a considerable
impact on the outcome of classifying the document at hand. For example, the
Anchor algorithm applies a statistical analysis of the sensitivity of the
classifier to changes in the token. Aggregating local explanations over a
dataset provides a global explanation of the model. Such aggregation aims to
detect words with the most impact, giving valuable insights about the model,
like what it has learned in training and which adversarial examples expose its
weaknesses. However, standard aggregation methods bear a high computational
cost: a na\"ive implementation applies a costly algorithm to each token of each
document, and hence, it is infeasible for a simple user running in the scope of
a short analysis session. % We devise techniques for accelerating the global
aggregation of the Anchor algorithm. Specifically, our goal is to compute a set
of top-$k$ words with the highest global impact according to different
aggregation functions. Some of our techniques are lossless and some are lossy.
We show that for a very mild loss of quality, we are able to accelerate the
computation by up to 30$\times$, reducing the computation from hours to
minutes. We also devise and study a probabilistic model that accounts for noise
in the Anchor algorithm and diminishes the bias toward words that are frequent
yet low in impact. |
This paper tackles the challenge of efficiently identifying the top-k most impactful words in a document collection for explaining text classifiers, focusing on global aggregation of the Anchor algorithm's local explanations. |
Global aggregation of local explanations like Anchor is computationally expensive, hindering online analysis. This work provides both a novel probabilistic aggregation method that improves the quality of results and runtime optimizations making it practical for interactive use. |
The authors propose a probabilistic model (GPR) to estimate the importance of words as explanations, considering frequency and noise. They introduce runtime optimizations including incremental evaluation, candidate filtering, and adjusted hyperparameters for Anchor. Experiments evaluate the quality and speed of their approach across various datasets and classification tasks. |
GPR consistently outperforms baseline aggregations in identifying impactful terms. Optimizations, particularly increasing the confidence parameter (delta) in Anchor, significantly accelerate computation (up to 30x) with minimal or even positive impact on quality. Case studies demonstrate the interpretability of identified terms. |
Future work includes extending the approach to multi-word terms, adapting the optimizations to other local attribution methods, and exploring alternative document traversal orders during aggregation. |
analysis, interpretability, local explanation, global explanation, anchor algorithm, text classification, runtime optimization, anytime algorithm |
2311.09257 |
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs |
Yanwu Xu, Yang Zhao, Zhisheng Xiao, Tingbo Hou |
Text-to-image diffusion models have demonstrated remarkable capabilities in
transforming textual prompts into coherent images, yet the computational cost
of their inference remains a persistent challenge. To address this issue, we
present UFOGen, a novel generative model designed for ultra-fast, one-step
text-to-image synthesis. In contrast to conventional approaches that focus on
improving samplers or employing distillation techniques for diffusion models,
UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN
objective. Leveraging a newly introduced diffusion-GAN objective and
initialization with pre-trained diffusion models, UFOGen excels in efficiently
generating high-quality images conditioned on textual descriptions in a single
step. Beyond traditional text-to-image generation, UFOGen showcases versatility
in applications. Notably, UFOGen stands among the pioneering models enabling
one-step text-to-image generation and diverse downstream tasks, presenting a
significant advancement in the landscape of efficient generative models. |
This paper introduces UFOGen, a novel text-to-image generative model that leverages a hybrid approach combining diffusion models with a Generative Adversarial Network (GAN) objective to enable ultra-fast, one-step image generation from text prompts. |
This paper is important because it addresses a key limitation of traditional text-to-image diffusion models, namely their slow inference speed due to the multi-step denoising process. UFOGen's ability to generate high-quality images in a single step significantly improves efficiency and expands the potential applications of such models. |
The authors achieve one-step generation by modifying existing diffusion-GAN hybrid models in two key ways: 1) They introduce a new generator parameterization that samples from the forward diffusion process instead of the posterior, allowing for distribution matching at the clean image level. 2) They enhance the reconstruction loss to explicitly match the generated clean image with the target. By initializing UFOGen with a pre-trained Stable Diffusion model, they leverage existing knowledge about text-image relationships and achieve stable training with fast convergence. |
UFOGen successfully generates high-quality images from text prompts in a single step, outperforming existing few-step diffusion models like Progressive Distillation and Latent Consistency Models in terms of visual quality. It demonstrates comparable performance to InstaFlow while offering advantages in training efficiency and a simpler training pipeline. Furthermore, UFOGen exhibits versatility by successfully adapting to downstream tasks like image-to-image generation and controllable generation, highlighting its flexibility and broader applicability. |
The paper acknowledges limitations common to SD-based models, such as object missing, attribute leakage, and counting errors. Future work could focus on addressing these limitations and further exploring UFOGen's potential in more complex generative scenarios, such as video generation or 3D object synthesis. Additionally, investigating the model's capabilities under various guidance scales and comparing its performance against a wider range of text-to-image models would provide a more comprehensive understanding of its strengths and limitations. |
diffusion_model, gan, text-to-image, one-step generation, image-to-image, controllable generation |
2403.07860 |
Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation |
Shihao Zhao, Shaozhe Hao, Bojia Zi, Huaizhe Xu, Kwan-Yee K. Wong |
Text-to-image generation has made significant advancements with the
introduction of text-to-image diffusion models. These models typically consist
of a language model that interprets user prompts and a vision model that
generates corresponding images. As language and vision models continue to
progress in their respective domains, there is a great potential in exploring
the replacement of components in text-to-image diffusion models with more
advanced counterparts. A broader research objective would therefore be to
investigate the integration of any two unrelated language and generative vision
models for text-to-image generation. In this paper, we explore this objective
and propose LaVi-Bridge, a pipeline that enables the integration of diverse
pre-trained language models and generative vision models for text-to-image
generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and
plug-and-play approach without requiring modifications to the original weights
of the language and vision models. Our pipeline is compatible with various
language models and generative vision models, accommodating different
structures. Within this framework, we demonstrate that incorporating superior
modules, such as more advanced language models or generative vision models,
results in notable improvements in capabilities like text alignment or image
quality. Extensive evaluations have been conducted to verify the effectiveness
of LaVi-Bridge. Code is available at
https://github.com/ShihaoZhaoZSH/LaVi-Bridge. |
This paper introduces LaVi-Bridge, a novel framework designed for text-to-image diffusion models, aiming to seamlessly integrate various pre-trained language models and generative vision models. |
This research is crucial due to the rapid advancements in both language and vision models, making it challenging to integrate them into existing text-to-image diffusion models. LaVi-Bridge addresses this challenge by offering a flexible and efficient way to combine diverse models, potentially leading to significant improvements in text-to-image generation capabilities. |
LaVi-Bridge employs LoRA (Low-Rank Adaptation) to inject trainable parameters into pre-trained language and vision models without altering their original weights. Additionally, it utilizes an adapter to bridge the gap between these two modules, facilitating effective communication and alignment. The framework is trained on a dataset of text-image pairs, enabling the integrated models to generate coherent and contextually relevant images from textual prompts. |
Experiments demonstrate LaVi-Bridge's effectiveness in integrating various language models (CLIP, T5 series, Llama-2) and vision models (U-Net, Vision Transformer). Notably, incorporating superior models leads to enhanced performance, such as improved semantic understanding with advanced language models (e.g., Llama-2) and enhanced image quality and aesthetics with powerful vision models (e.g., PixArt's Transformer). |
The authors acknowledge that while LaVi-Bridge exhibits promising results, training with it on the same models and weights as existing text-to-image diffusion models may not always yield significant improvements. They emphasize that LaVi-Bridge primarily aims to integrate diverse language and vision models, enabling the use of more advanced models for potential performance enhancements. Future research directions could explore larger and more diverse datasets to further improve LaVi-Bridge's versatility and address the limitations associated with training data diversity. |
diffusion_model, text-to-image, language_model, vision_model, lora, adapter, image_generation, semantic_understanding, image_quality |
2403.19716 |
Capability-aware Prompt Reformulation Learning for Text-to-Image Generation |
Jingtao Zhan, Qingyao Ai, Yiqun Liu, Jia Chen, Shaoping Ma |
Text-to-image generation systems have emerged as revolutionary tools in the
realm of artistic creation, offering unprecedented ease in transforming textual
prompts into visual art. However, the efficacy of these systems is intricately
linked to the quality of user-provided prompts, which often poses a challenge
to users unfamiliar with prompt crafting. This paper addresses this challenge
by leveraging user reformulation data from interaction logs to develop an
automatic prompt reformulation model. Our in-depth analysis of these logs
reveals that user prompt reformulation is heavily dependent on the individual
user's capability, resulting in significant variance in the quality of
reformulation pairs. To effectively use this data for training, we introduce
the Capability-aware Prompt Reformulation (CAPR) framework. CAPR innovatively
integrates user capability into the reformulation process through two key
components: the Conditional Reformulation Model (CRM) and Configurable
Capability Features (CCF). CRM reformulates prompts according to a specified
user capability, as represented by CCF. The CCF, in turn, offers the
flexibility to tune and guide the CRM's behavior. This enables CAPR to
effectively learn diverse reformulation strategies across various user
capacities and to simulate high-capability user reformulation during inference.
Extensive experiments on standard text-to-image generation benchmarks showcase
CAPR's superior performance over existing baselines and its remarkable
robustness on unseen systems. Furthermore, comprehensive analyses validate the
effectiveness of different components. CAPR can facilitate user-friendly
interaction with text-to-image systems and make advanced artistic creation more
achievable for a broader range of users. |
This paper presents CAPR, a novel capability-aware prompt reformulation framework designed for text-to-image generation, which leverages user interaction logs to automatically improve user prompts. |
This work addresses the challenge of crafting effective prompts for text-to-image generation systems, a task often difficult for average users. It's significant because it's the first to leverage interaction logs for this purpose, offering a practical solution to enhance user experience and generation quality. |
The authors analyze interaction logs to understand user reformulation patterns and develop CAPR, comprising a Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). They train CRM on reformulation pairs, conditioned on CCF representing user capability. During inference, CCF is optimized to guide CRM towards high-quality reformulations. |
Experimental results demonstrate that CAPR significantly outperforms various baselines, including large language models and models trained on synthetic data. It exhibits strong performance on both seen and unseen text-to-image generation systems, demonstrating its effectiveness and robustness. |
The paper acknowledges that finding the optimal configuration for CCF can be time-consuming, though mitigated by techniques like Bayesian optimization. Future work could explore alternative CCF representations or personalize reformulations based on individual user styles. |
diffusion_model, text-to-image generation, prompt reformulation, analysis, log analysis |
2309.16671 |
Demystifying CLIP Data |
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer |
Contrastive Language-Image Pre-training (CLIP) is an approach that has
advanced research and applications in computer vision, fueling modern
recognition systems and generative models. We believe that the main ingredient
to the success of CLIP is its data and not the model architecture or
pre-training objective. However, CLIP only provides very limited information
about its data and how it has been collected, leading to works that aim to
reproduce CLIP's data by filtering with its model parameters. In this work, we
intend to reveal CLIP's data curation approach and in our pursuit of making it
open to the community introduce Metadata-Curated Language-Image Pre-training
(MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's
concepts) and yields a balanced subset over the metadata distribution. Our
experimental study rigorously isolates the model and training settings,
concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M
image-text data pairs outperforms CLIP's data on multiple standard benchmarks.
In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy,
surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining
the same training budget, attains 72.4%. Our observations hold across various
model sizes, exemplified by ViT-H achieving 80.5%, without any
bells-and-whistles. Curation code and training data distribution on metadata is
made available at https://github.com/facebookresearch/MetaCLIP. |
This paper investigates the data curation process behind CLIP, proposing MetaCLIP, a transparent algorithm that uses metadata and balancing techniques to create high-quality image-text datasets from web sources like CommonCrawl. |
The paper is important because it sheds light on the critical role of data curation in the success of CLIP, provides a method to reproduce and potentially outperform CLIP's dataset, and emphasizes the importance of data transparency in AI. |
The authors meticulously reconstruct CLIP's metadata and analyze the sub-string matching and balancing techniques likely employed in CLIP's data curation. They then propose MetaCLIP, an algorithm that takes a raw data pool and metadata as input and outputs a balanced dataset. They evaluate MetaCLIP by training vision models using their curated data and comparing the performance against models trained on CLIP's data and other publicly available datasets. |
MetaCLIP, trained on a 400M image-text pair dataset curated from CommonCrawl, outperforms CLIP's proprietary WIT400M dataset on multiple benchmarks, including ImageNet zero-shot classification. Scaling MetaCLIP to 1B and 2.5B data points further improves accuracy, achieving unprecedented results for various ViT model sizes, all within the same training budget as the original CLIP. |
The authors acknowledge that their reconstruction of CLIP's metadata might not be perfectly accurate due to limited information available publicly. They also plan to improve the scalability of their data pipeline for handling even larger datasets. Further research is needed to explore the impact of different metadata sources and balancing strategies. |
diffusion_model, clip, analysis, data_curation, image_text, zero_shot |
2311.11919 |
An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis |
Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, Balaji Vasan Srinivasan |
We consider the problem of constraining diffusion model outputs with a
user-supplied reference image. Our key objective is to extract multiple
attributes (e.g., color, object, layout, style) from this single reference
image, and then generate new samples with them. One line of existing work
proposes to invert the reference images into a single textual conditioning
vector, enabling generation of new samples with this learned token. These
methods, however, do not learn multiple tokens that are necessary to condition
model outputs on the multiple attributes noted above. Another line of
techniques expand the inversion space to learn multiple embeddings but they do
this only along the layer dimension (e.g., one per layer of the DDPM model) or
the timestep dimension (one for a set of timesteps in the denoising process),
leading to suboptimal attribute disentanglement. To address the aforementioned
gaps, the first contribution of this paper is an extensive analysis to
determine which attributes are captured in which dimension of the denoising
process. As noted above, we consider both the time-step dimension (in reverse
denoising) as well as the DDPM model layer dimension. We observe that often a
subset of these attributes are captured in the same set of model layers and/or
across same denoising timesteps. For instance, color and style are captured
across same U-Net layers, whereas layout and color are captured across same
timestep stages. Consequently, an inversion process that is designed only for
the time-step dimension or the layer dimension is insufficient to disentangle
all attributes. This leads to our second contribution where we design a new
multi-attribute inversion algorithm, MATTE, with associated
disentanglement-enhancing regularization losses, that operates across both
dimensions and explicitly leads to four disentangled tokens (color, style,
layout, and object). |
This paper presents MATTE, a novel multi-attribute inversion algorithm for text-to-image diffusion models, enabling the extraction and disentanglement of color, object, layout, and style attributes from a single reference image for controlled image synthesis. |
This work is significant because it addresses the limitations of existing inversion methods that struggle to disentangle multiple visual attributes from a reference image. By learning disentangled tokens for color, object, layout, and style, MATTE enables more fine-grained control over image generation conditioned on a reference. |
The authors first conduct an extensive analysis of attribute distribution across layers and timesteps in the diffusion process. Informed by this analysis, they propose MATTE, which learns separate tokens for each attribute and trains them to influence specific layers or timesteps, thus achieving disentanglement. They introduce a novel loss function that encourages reconstruction fidelity while enforcing disentanglement among color, style, object, and layout. |
MATTE demonstrates superior performance in extracting and transferring individual attributes and their combinations from a reference image to new generations. Qualitative results showcase its ability to control color, object, layout, and style independently, outperforming existing methods like P+ and ProSpect. Quantitative evaluations using CLIP similarity scores further validate the effectiveness of MATTE in learning disentangled and semantically meaningful attribute tokens. |
The paper acknowledges limitations in terms of computational cost for the inversion process. Additionally, it recognizes that the final generation quality is limited by the base diffusion model's capabilities. Future work could focus on optimizing the efficiency of the inversion algorithm and exploring alternative methods to improve attribute control during generation, such as fine-tuning model weights. |
diffusion_model, inversion, text-to-image, attribute-guided, disentanglement, analysis, image_synthesis, reference_image |
2308.10718 |
Backdooring Textual Inversion for Concept Censorship |
Yutong Wu, Jie Zhang, Florian Kerschbaum, Tianwei Zhang |
Recent years have witnessed success in AIGC (AI Generated Content). People
can make use of a pre-trained diffusion model to generate images of high
quality or freely modify existing pictures with only prompts in nature
language. More excitingly, the emerging personalization techniques make it
feasible to create specific-desired images with only a few images as
references. However, this induces severe threats if such advanced techniques
are misused by malicious users, such as spreading fake news or defaming
individual reputations. Thus, it is necessary to regulate personalization
models (i.e., concept censorship) for their development and advancement.
In this paper, we focus on the personalization technique dubbed Textual
Inversion (TI), which is becoming prevailing for its lightweight nature and
excellent performance. TI crafts the word embedding that contains detailed
information about a specific object. Users can easily download the word
embedding from public websites like Civitai and add it to their own stable
diffusion model without fine-tuning for personalization. To achieve the concept
censorship of a TI model, we propose leveraging the backdoor technique for good
by injecting backdoors into the Textual Inversion embeddings. Briefly, we
select some sensitive words as triggers during the training of TI, which will
be censored for normal use. In the subsequent generation stage, if the triggers
are combined with personalized embeddings as final prompts, the model will
output a pre-defined target image rather than images including the desired
malicious concept.
To demonstrate the effectiveness of our approach, we conduct extensive
experiments on Stable Diffusion, a prevailing open-sourced text-to-image model.
Our code, data, and results are available at
https://concept-censorship.github.io. |
This paper presents a novel method for concept censorship in AI image generation by backdooring Textual Inversion (TI), a popular personalization technique. |
This paper addresses the growing concern of misuse of AI image generation for malicious purposes like spreading misinformation or creating harmful content, by proposing a method to regulate personalization models without completely disabling them. |
The authors propose a two-term loss function for training TI, incorporating a backdoor term that associates specific trigger words (sensitive concepts) with pre-defined target images, effectively preventing the generation of undesired content when those words are present in the prompt. |
Experiments demonstrate the effectiveness of their method in censoring single words and blacklists of words, while preserving the utility of the TI for benign use. The method also exhibits robustness against potential countermeasures like word embedding removal and perturbation. |
Limitations include the need for the publisher to retrain the TI model and the dependence on hyperparameter tuning. Future work could explore data-free approaches, reduce reliance on hyperparameters, and investigate semantic-wise censoring for improved practicality. |
diffusion_model, textual_inversion, backdoor_attack, concept_censorship, aigc, misinformation, ethics |
2309.11497 |
FreeU: Free Lunch in Diffusion U-Net |
Chenyang Si, Ziqi Huang, Yuming Jiang, Ziwei Liu |
In this paper, we uncover the untapped potential of diffusion U-Net, which
serves as a "free lunch" that substantially improves the generation quality on
the fly. We initially investigate the key contributions of the U-Net
architecture to the denoising process and identify that its main backbone
primarily contributes to denoising, whereas its skip connections mainly
introduce high-frequency features into the decoder module, causing the network
to overlook the backbone semantics. Capitalizing on this discovery, we propose
a simple yet effective method-termed "FreeU" - that enhances generation quality
without additional training or finetuning. Our key insight is to strategically
re-weight the contributions sourced from the U-Net's skip connections and
backbone feature maps, to leverage the strengths of both components of the
U-Net architecture. Promising results on image and video generation tasks
demonstrate that our FreeU can be readily integrated to existing diffusion
models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion,
to improve the generation quality with only a few lines of code. All you need
is to adjust two scaling factors during inference. Project page:
https://chenyangsi.top/FreeU/. |
This paper introduces FreeU, a method for improving the sample quality of diffusion models during inference by re-weighting the contributions of skip connections and backbone features in the U-Net architecture. |
The paper is important because it addresses a critical gap in diffusion model research by focusing on the under-explored potential of the U-Net architecture itself, leading to improved generation quality without requiring additional training or increasing computational costs. |
The authors conducted experiments using various diffusion models, including Stable Diffusion, DreamBooth, ModelScope, and Rerender, applying FreeU during inference. They analyzed the impact of backbone and skip connection scaling factors on the generated images and videos, comparing them with the baseline models. |
The key finding is that FreeU significantly improves the quality of generated images and videos across various tasks, including text-to-image synthesis, text-to-video generation, image editing, and video-to-video translation. Notably, FreeU achieves these enhancements without requiring any additional training or fine-tuning of the models, making it a practical solution for enhancing diffusion model output. |
The paper doesn't explicitly mention limitations, however, potential future work could explore the optimal balancing of backbone and skip connection features for specific tasks. Additionally, investigating the application of FreeU in other diffusion model architectures beyond U-Net would be beneficial. |
diffusion_model, u-net, image_generation, video_generation, sample_quality, denoising, freeu |
2311.12092 |
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models |
Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau |
We present a method to create interpretable concept sliders that enable
precise control over attributes in image generations from diffusion models. Our
approach identifies a low-rank parameter direction corresponding to one concept
while minimizing interference with other attributes. A slider is created using
a small set of prompts or sample images; thus slider directions can be created
for either textual or visual concepts. Concept Sliders are plug-and-play: they
can be composed efficiently and continuously modulated, enabling precise
control over image generation. In quantitative experiments comparing to
previous editing techniques, our sliders exhibit stronger targeted edits with
lower interference. We showcase sliders for weather, age, styles, and
expressions, as well as slider compositions. We show how sliders can transfer
latents from StyleGAN for intuitive editing of visual concepts for which
textual description is difficult. We also find that our method can help address
persistent quality issues in Stable Diffusion XL including repair of object
deformations and fixing distorted hands. Our code, data, and trained sliders
are available at https://sliders.baulab.info/ |
This paper introduces Concept Sliders, a method for fine-tuning diffusion models using low-rank adaptations (LoRA) to enable precise and interpretable control over image attributes. |
This work is significant because it addresses limitations of existing diffusion model editing techniques by providing: 1) fine-grained control over continuous attributes, 2) composability for multi-attribute editing, 3) ability to learn visual concepts from image pairs, 4) transfer of style latents from GANs, and 5) improvement of image quality by fixing common distortions. |
The authors train LoRA adaptors using a guided score function that encourages the generation of images with desired attributes while preserving unrelated features. They use text prompt pairs, image pairs, and StyleGAN latents to define concepts and train the sliders. They evaluate their method on Stable Diffusion XL and SD v1.4, measuring CLIP score change, LPIPS distance, and conducting user studies to assess image quality. |
Key findings include: 1) Concept Sliders enable precise control over various attributes, 2) image-based sliders effectively capture visual concepts, 3) StyleGAN latents can be transferred to diffusion models for nuanced style editing, and 4) sliders can fix hand distortions and enhance overall realism, as confirmed by user studies. |
Limitations include residual interference between edits and a potential trade-off between edit strength and structural coherence when using the SDEdit technique. Future work could explore automated methods for minimizing interference and improving edit strength without sacrificing image structure. |
diffusion_model, lora, analysis, image_editing, gan, stylegan, interpretability, 3d, concept_sliders |
2404.02258 |
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models |
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro |
Transformer-based language models spread FLOPs uniformly across input
sequences. In this work we demonstrate that transformers can instead learn to
dynamically allocate FLOPs (or compute) to specific positions in a sequence,
optimising the allocation along the sequence for different layers across the
model depth. Our method enforces a total compute budget by capping the number
of tokens ($k$) that can participate in the self-attention and MLP computations
at a given layer. The tokens to be processed are determined by the network
using a top-$k$ routing mechanism. Since $k$ is defined a priori, this simple
procedure uses a static computation graph with known tensor sizes, unlike other
conditional computation techniques. Nevertheless, since the identities of the
$k$ tokens are fluid, this method can expend FLOPs non-uniformly across the
time and model depth dimensions. Thus, compute expenditure is entirely
predictable in sum total, but dynamic and context-sensitive at the token-level.
Not only do models trained in this way learn to dynamically allocate compute,
they do so efficiently. These models match baseline performance for equivalent
FLOPS and wall-clock times to train, but require a fraction of the FLOPs per
forward pass, and can be upwards of 50\% faster to step during post-training
sampling. |
This paper introduces Mixture-of-Depths (MoD), a novel technique for transformer models that dynamically allocates compute resources by allowing tokens to skip entire transformer blocks based on learned routing decisions, thereby reducing computational cost without sacrificing performance. |
This paper is important because it addresses the inherent inefficiency of traditional transformers, which expend uniform computational effort per token regardless of the complexity of the prediction. MoD offers a pathway to significantly reduce the computational cost of training and inference in transformers, particularly relevant for resource-intensive large language models, by selectively allocating compute resources where they are most needed. |
The authors propose a method where a per-block router assigns scalar weights to each token, indicating the importance of processing that token through the block. The top-k tokens with the highest weights are processed through the self-attention and MLP layers, while the rest bypass the block through a residual connection. This dynamic allocation is achieved using a non-causal top-k routing scheme during training and a causal predictor-based routing scheme during inference, both of which are trained through the language modeling objective and an auxiliary task. The authors perform extensive experiments with different model sizes and FLOP budgets, comparing MoD transformers with traditional transformers, demonstrating significant performance gains and computational savings. |
Key findings include: (1) MoD transformers can outperform isoFLOP-optimal baseline transformers in terms of both performance and speed. (2) Optimal MoD configurations involve routing every other block and using a low capacity (e.g., 12.5% of the sequence length) for the computationally intensive blocks. (3) Learned routing is crucial for MoD's effectiveness, significantly outperforming stochastic routing schemes. (4) MoD can be seamlessly integrated with Mixture-of-Experts (MoE) models, further enhancing performance and efficiency. (5) The non-causal nature of top-k routing during training can be effectively addressed during autoregressive sampling using a causal predictor, resulting in minimal performance degradation. |
The paper acknowledges limitations and suggests future work: (1) While the current work focuses on a decoder-only setting, extending MoD to encoder-decoder architectures requires further investigation for efficient handling of sequential decoding with non-causal routing. (2) The paper primarily explores routing between standard transformer blocks and residual connections. Investigating routing to diverse computational paths like memory lookup or tool-use functions could be beneficial. (3) Future research could explore decoupling routing decisions for queries, keys, and values in self-attention, potentially leading to more nuanced and efficient compute allocation. (4) MoD's potential in drastically increasing context length for predictions by efficiently managing long-term memory through selective routing warrants further investigation. |
diffusion_model, llm, analysis, conditional_computation, transformer, efficiency, mixture-of-experts, routing, autoregressive_sampling, long-term_memory |
2311.16090 |
Self-correcting LLM-controlled Diffusion Models |
Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell |
Text-to-image generation has witnessed significant progress with the advent
of diffusion models. Despite the ability to generate photorealistic images,
current text-to-image diffusion models still often struggle to accurately
interpret and follow complex input text prompts. In contrast to existing models
that aim to generate images only with their best effort, we introduce
Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that
generates an image from the input prompt, assesses its alignment with the
prompt, and performs self-corrections on the inaccuracies in the generated
image. Steered by an LLM controller, SLD turns text-to-image generation into an
iterative closed-loop process, ensuring correctness in the resulting image. SLD
is not only training-free but can also be seamlessly integrated with diffusion
models behind API access, such as DALL-E 3, to further boost the performance of
state-of-the-art diffusion models. Experimental results show that our approach
can rectify a majority of incorrect generations, particularly in generative
numeracy, attribute binding, and spatial relationships. Furthermore, by simply
adjusting the instructions to the LLM, SLD can perform image editing tasks,
bridging the gap between text-to-image generation and image editing pipelines.
We will make our code available for future research and applications. |
This paper introduces Self-correcting LLM-controlled Diffusion (SLD), a framework that improves text-to-image generation by iteratively identifying and correcting inaccuracies in images generated by diffusion models using an LLM and an object detector. |
This paper is important because it addresses a key limitation of current text-to-image diffusion models, which often struggle to accurately interpret and follow complex prompts. SLD provides a training-free method to improve the alignment between generated images and text prompts, leading to more accurate and reliable text-to-image generation. |
SLD employs a closed-loop approach. First, an image is generated from the input prompt using an off-the-shelf diffusion model. Then, an LLM parser extracts key objects from the prompt, which are then located in the image using an open-vocabulary object detector. Next, an LLM controller compares the detected objects with the prompt and suggests corrections (addition, deletion, repositioning, attribute modification). Finally, these corrections are implemented in the latent space of the diffusion model to generate a corrected image. This process can be repeated iteratively. |
SLD significantly improves image generation accuracy, particularly in handling numeracy, attribute binding, and spatial relationships. It outperforms existing methods on the LMD benchmark and shows significant improvements when applied to models like DALL-E 3. Additionally, SLD can be easily adapted for image editing tasks, achieving fine-grained control over object manipulation. |
One limitation is the difficulty in handling objects with complex shapes due to limitations in the object segmentation module. Future work could explore better region selection methods for improved generation and editing quality. Additionally, the authors suggest exploring the integration of advanced LMMs for more streamlined image assessment and editing. |
diffusion_model, llm, image_generation, image_editing, object_detection, self-correction, closed-loop |
2403.04692 |
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation |
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li |
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer
model~(DiT) capable of directly generating images at 4K resolution.
PixArt-\Sigma represents a significant advancement over its predecessor,
PixArt-\alpha, offering images of markedly higher fidelity and improved
alignment with text prompts. A key feature of PixArt-\Sigma is its training
efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it
evolves from the `weaker' baseline to a `stronger' model via incorporating
higher quality data, a process we term "weak-to-strong training". The
advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data:
PixArt-\Sigma incorporates superior-quality image data, paired with more
precise and detailed image captions. (2) Efficient Token Compression: we
propose a novel attention module within the DiT framework that compresses both
keys and values, significantly improving efficiency and facilitating
ultra-high-resolution image generation. Thanks to these improvements,
PixArt-\Sigma achieves superior image quality and user prompt adherence
capabilities with significantly smaller model size (0.6B parameters) than
existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD
Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K
images supports the creation of high-resolution posters and wallpapers,
efficiently bolstering the production of high-quality visual content in
industries such as film and gaming. |
\model is a Diffusion Transformer model capable of directly generating high-quality images at 4K resolution, building upon its predecessor \modelalpha with enhanced training data and an efficient token compression mechanism. |
This paper is significant as it addresses the challenge of efficiently training high-quality T2I models with limited resources. It introduces the concept of "weak-to-strong training," allowing for the incremental improvement of pre-trained models. Furthermore, \model pushes the boundary of resolution in T2I generation to 4K, a significant advancement in the field. |
The authors employ a "weak-to-strong training" strategy, starting with the pre-trained \modelalpha. They enhance the model by: (1) Curating a higher-quality dataset with better aesthetics, higher resolution (up to 4K), and more accurate and dense captions. (2) Introducing an efficient token compression mechanism within the DiT framework to handle the increased computational demands of 4K generation. (3) Proposing efficient fine-tuning techniques for rapid adaptation to new VAEs, higher resolutions, and KV compression. |
Key findings include: (1) \model achieves state-of-the-art 4K image generation with high fidelity and strong adherence to textual prompts. (2) The "weak-to-strong training" strategy proves highly efficient, requiring significantly fewer GPU days compared to training from scratch. (3) The proposed KV compression mechanism effectively reduces training and inference time without compromising quality. (4) Both human and AI preference studies confirm \model's superior performance over existing open-source models and competitive results with commercial T2I products. |
Limitations include the inability to perfectly generate certain objects and scenes like text and hands, limitations in handling complex prompts, and potential biases in generated content. Future work involves improving data quality, scaling model size, enhancing alignment with complex instructions, and addressing ethical concerns related to biases and sensitive content. |
diffusion_model, dit, t2i, 4k, text-to-image, high-resolution, efficient training, token compression, weak-to-strong training |
2309.00013 |
Model Inversion Attack via Dynamic Memory Learning |
Gege Qi, YueFeng Chen, Xiaofeng Mao, Binyuan Hui, Xiaodan Li, Rong Zhang, Hui Xue |
Model Inversion (MI) attacks aim to recover the private training data from
the target model, which has raised security concerns about the deployment of
DNNs in practice. Recent advances in generative adversarial models have
rendered them particularly effective in MI attacks, primarily due to their
ability to generate high-fidelity and perceptually realistic images that
closely resemble the target data. In this work, we propose a novel Dynamic
Memory Model Inversion Attack (DMMIA) to leverage historically learned
knowledge, which interacts with samples (during the training) to induce diverse
generations. DMMIA constructs two types of prototypes to inject the information
about historically learned knowledge: Intra-class Multicentric Representation
(IMR) representing target-related concepts by multiple learnable prototypes,
and Inter-class Discriminative Representation (IDR) characterizing the
memorized samples as learned prototypes to capture more privacy-related
information. As a result, our DMMIA has a more informative representation,
which brings more diverse and discriminative generated results. Experiments on
multiple benchmarks show that DMMIA performs better than state-of-the-art MI
attack methods. |
This paper introduces DMMIA, a novel model inversion attack method that leverages dynamic memory mechanisms to recover private training data from trained deep neural networks, addressing the catastrophic forgetting issue in existing GAN-based attacks. |
This paper is important because it exposes a significant vulnerability in trained DNN models, demonstrating that sensitive information about training data can be effectively extracted even without direct access to the data itself. |
The authors propose DMMIA, which uses two types of memory prototypes: Intra-class Multicentric Representation (IMR) for capturing diverse target-related concepts and Inter-class Discriminative Representation (IDR) for distinguishing between classes. These prototypes are progressively updated during training, enabling the attack to retain previously learned features and enhance the diversity and realism of generated samples. |
DMMIA achieves state-of-the-art attack performance on multiple benchmark datasets, including CelebA, FaceScrub, and Stanford Dogs, outperforming existing methods in terms of attack success rate, sample realism (FID), and sample diversity metrics. Notably, it demonstrates significant improvements when attacking models trained on datasets with limited image priors, highlighting its effectiveness in scenarios where the attacker has less knowledge about the target data distribution. |
The authors acknowledge the dependence of attack success on the diversity of the image prior used in pre-training the StyleGAN2 generator. Future work could explore ways to improve the attack's effectiveness when prior knowledge about the target data is limited. Additionally, extending DMMIA to black-box settings, where the attacker only has access to the model's predictions, is mentioned as a potential research direction. |
model_inversion_attack, gan, adversarial_attack, interpretability, privacy, dynamic_memory, prototype_learning |
2308.09124 |
Linearity of Relation Decoding in Transformer Language Models |
Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau |
Much of the knowledge encoded in transformer language models (LMs) may be
expressed in terms of relations: relations between words and their synonyms,
entities and their attributes, etc. We show that, for a subset of relations,
this computation is well-approximated by a single linear transformation on the
subject representation. Linear relation representations may be obtained by
constructing a first-order approximation to the LM from a single prompt, and
they exist for a variety of factual, commonsense, and linguistic relations.
However, we also identify many cases in which LM predictions capture relational
knowledge accurately, but this knowledge is not linearly encoded in their
representations. Our results thus reveal a simple, interpretable, but
heterogeneously deployed knowledge representation strategy in transformer LMs. |
This paper investigates how transformer language models (LLMs) represent relational knowledge, finding that a subset of relations can be approximated by linear transformations applied to subject representations. |
This work sheds light on the internal mechanisms of LLMs, revealing that some aspects of their knowledge representation are surprisingly simple and interpretable. This finding contributes to our understanding of how LLMs store and process information, potentially enabling more transparent and controllable AI systems. |
The authors manually curate a dataset of relations and corresponding subject-object pairs. They then estimate a linear relational embedding (LRE) for each relation by calculating the Jacobian of the model's computation on a prompt designed to elicit the relation. They evaluate the faithfulness of the LRE by measuring how well it predicts the model's output for new subjects, and its causality by using it to edit subject representations and induce the model to predict different objects. |
The research shows that LREs can faithfully approximate LLM relation decoding for a significant portion of the tested relations. They also demonstrate the causal influence of these LREs by successfully manipulating model predictions via representation editing. Interestingly, the study reveals that not all relations are linearly encoded, suggesting a more complex, non-linear processing mechanism for certain types of information. |
The paper acknowledges limitations in the dataset size, the reliance on first-token correctness as an evaluation metric, and the assumption of single correct objects for relations. Future work could address these limitations, exploring a wider range of relations, refining the evaluation scheme, and investigating how LREs could be used to understand and mitigate biases in LLMs. |
llm, analysis, interpretability, knowledge_representation, relation, linear_transformation |
2310.01506 |
Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code |
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, Qiang Xu |
Text-guided diffusion models have revolutionized image generation and
editing, offering exceptional realism and diversity. Specifically, in the
context of diffusion-based editing, where a source image is edited according to
a target prompt, the process commences by acquiring a noisy latent vector
corresponding to the source image via the diffusion model. This vector is
subsequently fed into separate source and target diffusion branches for
editing. The accuracy of this inversion process significantly impacts the final
editing outcome, influencing both essential content preservation of the source
image and edit fidelity according to the target prompt. Prior inversion
techniques aimed at finding a unified solution in both the source and target
diffusion branches. However, our theoretical and empirical analyses reveal that
disentangling these branches leads to a distinct separation of responsibilities
for preserving essential content and ensuring edit fidelity. Building on this
insight, we introduce "Direct Inversion," a novel technique achieving optimal
performance of both branches with just three lines of code. To assess image
editing performance, we present PIE-Bench, an editing benchmark with 700 images
showcasing diverse scenes and editing types, accompanied by versatile
annotations and comprehensive evaluation metrics. Compared to state-of-the-art
optimization-based inversion techniques, our solution not only yields superior
performance across 8 editing methods but also achieves nearly an order of
speed-up. |
This paper introduces DirectInversion, a novel technique for inverting diffusion models in text-based image editing, which disentangles the source and target diffusion branches to excel in content preservation and edit fidelity, respectively. |
The paper addresses limitations in existing diffusion model inversion techniques used for text-based image editing, which often rely on computationally expensive optimization and may compromise either content preservation or edit fidelity. The authors argue for a disentangled approach to optimize both aspects, and introduce a new benchmark dataset for evaluation. |
The authors propose DirectInversion, which directly rectifies the deviation path in the source branch using a simple three-line code modification to DDIM inversion. They introduce PIE-Bench, a new benchmark dataset with 700 images and diverse editing categories, to evaluate their method across 8 different editing techniques and against existing inversion methods using 7 evaluation metrics. |
DirectInversion demonstrates superior performance compared to existing optimization-based inversion methods, achieving significant improvements in essential content preservation (up to 83.2% enhancement in Structure Distance) and edit fidelity (up to 8.8% improvement in Edit Region Clip Similarity), while being significantly faster. The method also improves content preservation by up to 20.2% and edit fidelity by up to 2.5% when integrated with other editing techniques. |
The authors acknowledge limitations inherited from existing diffusion-based editing methods, such as instability and low success rates in certain complex editing scenarios. Future work includes extending the approach to video editing, developing more robust editing models, and creating more comprehensive evaluation metrics. |
diffusion_model, image_editing, inversion, benchmark, content_preservation, edit_fidelity |
2402.01293 |
Can MLLMs Perform Text-to-Image In-Context Learning? |
Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee |
The evolution from Large Language Models (LLMs) to Multimodal Large Language
Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to
its multimodal counterpart. Existing such studies have primarily concentrated
on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique
characteristics and potential applications, remains underexplored. To address
this gap, we formally define the task of T2I-ICL and present CoBSAT, the first
T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to
benchmark six state-of-the-art MLLMs, we uncover considerable difficulties
MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the
inherent complexity of multimodality and image generation, and show that
strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate
these difficulties, leading to notable improvements in performance. Our code
and dataset are available at https://github.com/UW-Madison-Lee-Lab/CoBSAT. |
This paper introduces the concept of Text-to-Image In-Context Learning (T2I-ICL), where Multimodal Large Language Models (MLLMs) generate images based on textual prompts and example image-text pairs, and presents CoBSAT, a new benchmark dataset to evaluate MLLMs' performance on T2I-ICL tasks. |
This paper addresses the under-explored area of T2I-ICL, contrasting it with the more common Image-to-Text ICL, and provides a benchmark for evaluating and understanding the capabilities of MLLMs in this domain, which is crucial for applications like product design and personalized content creation. |
The authors created CoBSAT, a dataset with 10 tasks covering five themes (color, background, style, action, texture), each with object-inference and attribute-inference variations. They evaluated six state-of-the-art MLLMs on this dataset using CLIP and LLaVA as evaluation metrics to assess the accuracy of generated images or image descriptions against true labels. |
The study found that existing MLLMs struggle with T2I-ICL, with SEED-LLaMA performing best in image generation and Gemini, Qwen-VL, and GPT-4V excelling in generating image descriptions. The paper also identifies multimodality and image generation as key challenges in T2I-ICL. Notably, fine-tuning models on CoBSAT and incorporating Chain-of-Thought prompting led to significant performance improvements. |
The paper acknowledges limitations in demonstration selection and the need to explore additional prompt engineering techniques like Tree-of-Thought and self-consistency sampling. Future work includes expanding CoBSAT with more themes and attributes, focusing on image editing tasks, and developing multimodal prompt engineering techniques. |
diffusion_model, llm, mllm, analysis, benchmark, dataset, image_generation, in-context learning, multimodality, prompt_engineering |
2312.00777 |
VideoBooth: Diffusion-based Video Generation with Image Prompts |
Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu |
Text-driven video generation witnesses rapid progress. However, merely using
text prompts is not enough to depict the desired subject appearance that
accurately aligns with users' intents, especially for customized content
creation. In this paper, we study the task of video generation with image
prompts, which provide more accurate and direct content control beyond the text
prompts. Specifically, we propose a feed-forward framework VideoBooth, with two
dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine
manner. Coarse visual embeddings from image encoder provide high-level
encodings of image prompts, while fine visual embeddings from the proposed
attention injection module provide multi-scale and detailed encoding of image
prompts. These two complementary embeddings can faithfully capture the desired
appearance. 2) In the attention injection module at fine level, multi-scale
image prompts are fed into different cross-frame attention layers as additional
keys and values. This extra spatial information refines the details in the
first frame and then it is propagated to the remaining frames, which maintains
temporal consistency. Extensive experiments demonstrate that VideoBooth
achieves state-of-the-art performance in generating customized high-quality
videos with subjects specified in image prompts. Notably, VideoBooth is a
generalizable framework where a single model works for a wide range of image
prompts with feed-forward pass. |
This paper introduces VideoBooth, a novel framework for generating videos using both text prompts and image prompts for customized content creation. |
This paper is important because it addresses limitations in text-driven video generation by incorporating image prompts for more precise control over subject appearance, which is crucial for customized content creation. |
The authors propose a coarse-to-fine visual embedding strategy: 1) A CLIP image encoder extracts coarse visual embeddings from image prompts, capturing high-level semantic information. 2) Fine visual embeddings are extracted through an attention injection module, incorporating multi-scale image prompts into cross-frame attention layers for refining details and maintaining temporal consistency. The authors also created a dedicated VideoBooth dataset for training and evaluating their model. |
VideoBooth demonstrates state-of-the-art performance in generating high-quality, customized videos, effectively preserving visual attributes from image prompts while maintaining alignment with text prompts. Ablation studies confirm the effectiveness of the coarse-to-fine training strategy and both embedding modules. |
The authors acknowledge the potential negative societal impact of generating fake videos and suggest exploring advanced fake video detection methods as future work. Additionally, processing the full WebVid dataset and expanding the VideoBooth dataset is mentioned as future work. |
diffusion_model, video, generation, image_prompt, customized_content_creation, attention_mechanism |
2309.03886 |
FIND: A Function Description Benchmark for Evaluating Interpretability Methods |
Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba |
Labeling neural network submodules with human-legible descriptions is useful
for many downstream tasks: such descriptions can surface failures, guide
interventions, and perhaps even explain important model behaviors. To date,
most mechanistic descriptions of trained networks have involved small models,
narrowly delimited phenomena, and large amounts of human labor. Labeling all
human-interpretable sub-computations in models of increasing size and
complexity will almost certainly require tools that can generate and validate
descriptions automatically. Recently, techniques that use learned models
in-the-loop for labeling have begun to gain traction, but methods for
evaluating their efficacy are limited and ad-hoc. How should we validate and
compare open-ended labeling tools? This paper introduces FIND (Function
INterpretation and Description), a benchmark suite for evaluating the building
blocks of automated interpretability methods. FIND contains functions that
resemble components of trained neural networks, and accompanying descriptions
of the kind we seek to generate. The functions span textual and numeric
domains, and involve a range of real-world complexities. We evaluate methods
that use pretrained language models (LMs) to produce descriptions of function
behavior in natural language and code. Additionally, we introduce a new
interactive method in which an Automated Interpretability Agent (AIA) generates
function descriptions. We find that an AIA, built from an LM with black-box
access to functions, can infer function structure, acting as a scientist by
forming hypotheses, proposing experiments, and updating descriptions in light
of new data. However, AIA descriptions tend to capture global function behavior
and miss local details. These results suggest that FIND will be useful for
evaluating more sophisticated interpretability methods before they are applied
to real-world models. |
This paper introduces FIND (Function Interpretation and Description), a benchmark suite for evaluating the ability of automated methods to interpret and describe the behavior of black-box functions. |
This paper addresses the growing need for automated interpretability methods for increasingly complex AI models by introducing a benchmark to evaluate and compare these methods on functions with known structures. |
The authors constructed FIND, a benchmark suite containing over 2000 procedurally generated functions with varying complexity and domains, including numeric, string, and synthetic neural modules. They evaluate different interpretation methods, including non-interactive (MILAN-like) and interactive (Automated Interpretability Agents), using off-the-shelf LMs like GPT-4, GPT-3.5, and Llama-2. Evaluation involves comparing generated descriptions with ground-truth explanations using code execution accuracy and a novel unit-testing protocol with a fine-tuned Vicuna-13b as an evaluator. |
GPT-4 consistently outperforms other LMs as an interpretability agent, demonstrating the potential of LMs for automated interpretability. However, even GPT-4 struggles with complex functions, highlighting the need for additional tools and techniques beyond current LMs. Initializing the AIA with exemplars dramatically improves performance, suggesting the importance of strategic data selection. The unit-testing protocol with the fine-tuned Vicuna evaluator demonstrates strong agreement with human judgments. |
The authors acknowledge that FIND focuses solely on black-box interpretation and lacks evaluation on real-world models. Future work will extend FIND to encompass white-box interpretation problems, including descriptions of individual components within neural circuits. Additionally, the authors aim to explore tools for enhanced sampling and fine-tuning LMs specifically for interpretability. |
diffusion_model, llm, analysis, interpretability |
2311.10538 |
Testing Language Model Agents Safely in the Wild |
Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau |
A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet
real-world autonomous tests face several unique safety challenges, both due to
the possibility of causing harm during a test, as well as the risk of
encountering new unsafe agent behavior through interactions with real-world and
potentially malicious actors. We propose a framework for conducting safe
autonomous agent tests on the open internet: agent actions are audited by a
context-sensitive monitor that enforces a stringent safety boundary to stop an
unsafe test, with suspect behavior ranked and logged to be examined by humans.
We design a basic safety monitor (AgentMonitor) that is flexible enough to
monitor existing LLM agents, and, using an adversarial simulated agent, we
measure its ability to identify and stop unsafe situations. Then we apply the
AgentMonitor on a battery of real-world tests of AutoGPT, and we identify
several limitations and challenges that will face the creation of safe
in-the-wild tests as autonomous agents grow more capable. |
This paper proposes a framework for conducting safe tests of autonomous language model agents on the open internet by introducing a context-sensitive safety monitor that can identify and stop unsafe agent actions. |
As language model agents become increasingly capable and prevalent, it's crucial to ensure they are tested safely in real-world environments to prevent potential harm and build trust in their deployment. |
The authors developed a dataset of agent outputs, including manually crafted unsafe examples, and designed a safety monitor (AgentMonitor) based on GPT-3.5-turbo. They trained and evaluated the monitor's ability to identify and stop unsafe actions using various parameters like task context, previous actions, and whitelists. |
The AgentMonitor achieved promising results on a test set, with an F1 score of 89.4%. Ablation studies revealed that access to the agent's previous context was crucial for the monitor's performance. The authors also highlighted the need for well-specified threat models and comprehensive example sets for few-shot learning in the monitor. |
The authors identify limitations such as the need for larger, better-categorized datasets of attacks and a clearer distinction between off-task and unsafe outputs. Future work will focus on improving the AgentMonitor's ability to make this distinction, minimizing the need for human intervention in safe testing. |
llm, analysis, safety, autonomous_agent, testing |
2308.09889 |
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization |
Xiaoyu Ye, Hao Huang, Jiaqi An, Yongtao Wang |
Stable Diffusion (SD) customization approaches enable users to personalize SD
model outputs, greatly enhancing the flexibility and diversity of AI art.
However, they also allow individuals to plagiarize specific styles or subjects
from copyrighted images, which raises significant concerns about potential
copyright infringement. To address this issue, we propose an invisible
data-free universal adversarial watermark (DUAW), aiming to protect a myriad of
copyrighted images from different customization approaches across various
versions of SD models. First, DUAW is designed to disrupt the variational
autoencoder during SD customization. Second, DUAW operates in a data-free
context, where it is trained on synthetic images produced by a Large Language
Model (LLM) and a pretrained SD model. This approach circumvents the necessity
of directly handling copyrighted images, thereby preserving their
confidentiality. Once crafted, DUAW can be imperceptibly integrated into
massive copyrighted images, serving as a protective measure by inducing
significant distortions in the images generated by customized SD models.
Experimental results demonstrate that DUAW can effectively distort the outputs
of fine-tuned SD models, rendering them discernible to both human observers and
a simple classifier. |
This paper introduces DUAW, a data-free universal adversarial watermark designed to protect copyrighted images from being used for unauthorized customization of Stable Diffusion models. |
The paper addresses the growing concern of copyright infringement facilitated by AI art customization tools. It offers a practical solution to protect intellectual property in the rapidly evolving field of AI-generated content. |
The authors develop DUAW by training it on synthetic images generated using a Large Language Model (LLM) and a pre-trained SD model. This data-free approach ensures confidentiality of the copyrighted images. The watermark disrupts the variational autoencoder (VAE) within SD models during customization, leading to distorted outputs when the customized model is used for generation. |
Experimental results demonstrate that DUAW effectively distorts images generated by customized SD models trained on watermarked images. This distortion is noticeable to human observers and detectable by a simple classifier, achieving high protection success rates. DUAW also exhibits strong transferability across different SD versions and VAE variants. |
The paper acknowledges the potential impact of image interference techniques on DUAW's robustness, although its effectiveness remains high. Future work could focus on enhancing robustness against more sophisticated interference methods and exploring DUAW's applicability to other diffusion-based models. |
diffusion_model, adversarial_watermark, copyright_protection, stable_diffusion, data-free, vae, llm, image_generation |
2311.14097 |
ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models |
Fei Kong, Jinhao Duan, Lichao Sun, Hao Cheng, Renjing Xu, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu |
Though diffusion models excel in image generation, their step-by-step
denoising leads to slow generation speeds. Consistency training addresses this
issue with single-step sampling but often produces lower-quality generations
and requires high training costs. In this paper, we show that optimizing
consistency training loss minimizes the Wasserstein distance between target and
generated distributions. As timestep increases, the upper bound accumulates
previous consistency training losses. Therefore, larger batch sizes are needed
to reduce both current and accumulated losses. We propose Adversarial
Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS)
divergence between distributions at each timestep using a discriminator.
Theoretically, ACT enhances generation quality, and convergence. By
incorporating a discriminator into the consistency training framework, our
method achieves improved FID scores on CIFAR10 and ImageNet 64$\times$64 and
LSUN Cat 256$\times$256 datasets, retains zero-shot image inpainting
capabilities, and uses less than $1/6$ of the original batch size and fewer
than $1/2$ of the model parameters and training steps compared to the baseline
method, this leads to a substantial reduction in resource consumption. Our code
is available:https://github.com/kong13661/ACT |
This paper introduces Adversarial Consistency Training (ACT), a novel method that enhances single-step image generation in consistency training by incorporating a discriminator, leading to faster sampling, reduced resource requirements, and improved generation quality compared to standard consistency training. |
The paper addresses the limitations of diffusion models, particularly slow generation speeds due to iterative denoising. While consistency training offers faster sampling, it often compromises generation quality. This research is important because it presents a more efficient and effective approach to single-step image generation using adversarial training within the consistency model framework. |
The authors first theoretically demonstrate that optimizing consistency training loss minimizes the Wasserstein distance between generated and target distributions, requiring large batch sizes to mitigate accumulating errors. To overcome this, they incorporate a discriminator that directly minimizes the Jensen-Shannon divergence between the distributions at each timestep, similar to GANs. This approach aims to enhance training efficiency and generation quality. The authors conduct experiments on CIFAR10, ImageNet 64x64, and LSUN Cat 256x256 datasets, comparing ACT with existing methods. Additionally, they perform ablation studies to analyze the impact of different components and hyperparameters on the model's performance. |
The proposed ACT method demonstrates superior FID scores compared to standard consistency training on all tested datasets while significantly reducing batch size, model parameters, and training steps. It achieves an FID of 6.0 on CIFAR10 with a batch size of 80, outperforming consistency training with a batch size of 512 (FID 8.7). Similar improvements are observed on ImageNet and LSUN Cat datasets, highlighting ACT's effectiveness and efficiency. |
The authors acknowledge the need for further exploration of the interaction between consistency training loss and adversarial loss for optimizing ACT. They also suggest exploring alternative distance metrics beyond Jensen-Shannon divergence to minimize the gap between distributions. Future research could focus on these aspects to further enhance the performance and stability of ACT. |
diffusion_model, gan, image_generation, consistency_training, adversarial_training, fast_sampling, resource_efficiency |
2401.07519 |
InstantID: Zero-shot Identity-Preserving Generation in Seconds |
Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, Yao Hu |
There has been significant progress in personalized image synthesis with
methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world
applicability is hindered by high storage demands, lengthy fine-tuning
processes, and the need for multiple reference images. Conversely, existing ID
embedding-based methods, while requiring only a single forward inference, face
challenges: they either necessitate extensive fine-tuning across numerous model
parameters, lack compatibility with community pre-trained models, or fail to
maintain high face fidelity. Addressing these limitations, we introduce
InstantID, a powerful diffusion model-based solution. Our plug-and-play module
adeptly handles image personalization in various styles using just a single
facial image, while ensuring high fidelity. To achieve this, we design a novel
IdentityNet by imposing strong semantic and weak spatial conditions,
integrating facial and landmark images with textual prompts to steer the image
generation. InstantID demonstrates exceptional performance and efficiency,
proving highly beneficial in real-world applications where identity
preservation is paramount. Moreover, our work seamlessly integrates with
popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving
as an adaptable plugin. Our codes and pre-trained checkpoints will be available
at https://github.com/InstantID/InstantID. |
This paper introduces InstantID, a novel plug-and-play diffusion model module for identity-preserving image generation that uses a single facial image to generate personalized images in various styles with high fidelity. |
This paper is important because it addresses limitations of existing personalized image synthesis methods that are either computationally expensive, require multiple reference images, or lack fidelity in preserving identity. InstantID offers a fast, efficient, and high-fidelity solution for real-world applications like e-commerce and AI portraits. |
The authors develop InstantID with three main components: 1) ID embedding using a pre-trained face model for strong identity features. 2) A lightweight image adapter module with decoupled cross-attention for image prompt integration. 3) IdentityNet, an adapted ControlNet using facial landmarks and ID embedding as conditions for preserving complex facial features. The model is trained on large-scale datasets like LAION-Face, optimizing only the adapter and IdentityNet while freezing the pre-trained diffusion model. |
InstantID demonstrates superior performance in preserving identity while maintaining stylistic flexibility, outperforming existing methods like IP-Adapter and achieving competitive results with LoRA models without requiring multiple images or training. It shows robustness, prompt editability, compatibility with ControlNet, and enables novel applications like novel view synthesis, identity interpolation, and multi-identity synthesis. |
Limitations include the highly coupled facial attributes in ID embedding and potential biases from the face model used. Future work could focus on decoupling facial attributes for better editing and addressing ethical considerations related to potential misuse. |
diffusion_model, identity_preserving, image_generation, face_embedding, controlnet, plug-and-play, single-shot, high-fidelity, image_synthesis |
2310.07702 |
ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models |
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan |
In this work, we investigate the capability of generating images from
pre-trained diffusion models at much higher resolutions than the training image
sizes. In addition, the generated images should have arbitrary image aspect
ratios. When generating images directly at a higher resolution, 1024 x 1024,
with the pre-trained Stable Diffusion using training images of resolution 512 x
512, we observe persistent problems of object repetition and unreasonable
object structures. Existing works for higher-resolution generation, such as
attention-based and joint-diffusion approaches, cannot well address these
issues. As a new perspective, we examine the structural components of the U-Net
in diffusion models and identify the crucial cause as the limited perception
field of convolutional kernels. Based on this key observation, we propose a
simple yet effective re-dilation that can dynamically adjust the convolutional
perception field during inference. We further propose the dispersed convolution
and noise-damped classifier-free guidance, which can enable
ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our
approach does not require any training or optimization. Extensive experiments
demonstrate that our approach can address the repetition issue well and achieve
state-of-the-art performance on higher-resolution image synthesis, especially
in texture details. Our work also suggests that a pre-trained diffusion model
trained on low-resolution images can be directly used for high-resolution
visual generation without further tuning, which may provide insights for future
research on ultra-high-resolution image and video synthesis. |
This paper investigates the generation of high-resolution images from pre-trained diffusion models, addressing the issue of object repetition and unreasonable structures often observed in direct high-resolution generation. |
This research is significant because it offers a solution to generate high-quality images at resolutions exceeding the training data, crucial for applications demanding large image sizes like advertisements, without requiring extensive retraining or fine-tuning. |
The authors analyze the structural components of diffusion models, identifying limited convolutional receptive fields as the root cause for object repetition. They propose 're-dilation,' a method to dynamically adjust the convolutional perception field during inference, and 'convolution dispersion' with 'noise-damped classifier-free guidance' to enhance generation quality at ultra-high resolutions. |
The proposed re-dilation method successfully mitigates object repetition issues and outperforms direct inference and attention scaling methods in terms of FID and KID scores across different Stable Diffusion versions and resolutions. The method also demonstrates superior texture detail preservation compared to a pre-trained super-resolution model. Furthermore, the approach generalizes well to text-to-video generation, enabling higher-resolution video synthesis without sacrificing image definition. |
The paper acknowledges limitations in evaluating texture definition using FID and KID, relying on a user preference study for assessment. Future work may explore optimizing the trade-off between image fidelity and denoising capabilities at ultra-high resolutions. Additionally, investigating the impact of re-dilation on other diffusion model applications like image editing and style transfer is suggested. |
diffusion_model, high_resolution, image_synthesis, re-dilation, convolution, perception_field, text-to-image, text-to-video, stable diffusion |