Table of Papers

Arxiv ID	Title	Authors	Abstract	What	Why	How	Result	LF	Tags
2310.03739	Aligning Text-to-Image Diffusion Models with Reward Backpropagation	Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki	Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to their unsupervised training, controlling their behavior in downstream tasks, such as maximizing human-perceived image quality, image-text alignment, or ethical image generation, is difficult. Recent works finetune diffusion models to downstream reward functions using vanilla reinforcement learning, notorious for the high variance of the gradient estimators. In this paper, we propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient through the denoising process. While naive implementation of such backpropagation would require prohibitive memory resources for storing the partial derivatives of modern text-to-image models, AlignProp finetunes low-rank adapter weight modules and uses gradient checkpointing, to render its memory usage viable. We test AlignProp in finetuning diffusion models to various objectives, such as image-text semantic alignment, aesthetics, compressibility and controllability of the number of objects present, as well as their combinations. We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler, making it a straightforward choice for optimizing diffusion models for differentiable reward functions of interest. Code and Visualization results are available at https://align-prop.github.io/.	This paper introduces AlignProp, a novel method for aligning text-to-image diffusion models with specific reward functions using end-to-end backpropagation through the denoising process, overcoming memory constraints with techniques like LoRA and gradient checkpointing.	This work is important because it provides a more efficient and effective way to adapt pre-trained diffusion models for specific downstream tasks that require optimizing for objectives like aesthetics, semantic alignment, or ethical image generation, which are difficult to achieve with standard training methods.	The authors frame denoising inference as a differentiable recurrent policy and train it using end-to-end backpropagation of gradients from a reward function. To handle memory issues, they fine-tune low-rank adapter (LoRA) weights and employ gradient checkpointing. To prevent overfitting to the reward function, they introduce randomized truncated backpropagation through time.	AlignProp achieves higher reward scores and converges faster than reinforcement learning baselines like DDPO. It also demonstrates better generalization to new prompts and is preferred by human evaluators for fidelity and image-text alignment. The paper shows that mixing weights of models finetuned on different reward functions allows for interpolation between these objectives.	The authors acknowledge the limitation of potential over-optimization when the reward function is imperfect and suggest that mitigating this risk is an area for future work. Additionally, extending AlignProp to diffusion-based language models for improved alignment with human feedback is another promising direction.	diffusion_model, alignment, image_generation, reward_function, backpropagation, lora, gradient_checkpointing, text-to-image, human_evaluation, generalization
2404.18928	Stylus: Automatic Adapter Selection for Diffusion Models	Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica	Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See stylus-diffusion.github.io for more.	This paper introduces Stylus, a system designed to automatically select and compose fine-tuned adapters for Stable Diffusion models based on user prompts to enhance image quality and diversity.	This paper addresses the challenge of leveraging the vast and growing number of publicly available adapters for Stable Diffusion, which are often poorly documented and require manual selection. Stylus automates this process, making it easier for users to generate high-quality images by automatically identifying and combining relevant adapters based on the prompt, leading to improvements in visual fidelity, textual alignment, and image diversity.	Stylus utilizes a three-stage framework: 1) Refiner: Employs a VLM to process adapter model cards and generate improved textual descriptions and embeddings for each adapter. 2) Retriever: Retrieves candidate adapters relevant to the user prompt by calculating cosine similarity scores between the prompt embedding and adapter embeddings. 3) Composer: Segments the prompt into keywords representing distinct tasks and assigns relevant adapters to each task using a long-context LLM, effectively filtering irrelevant adapters. Additionally, a masking strategy ensures diversity by applying different adapter combinations for a single prompt.	Stylus demonstrates significant improvements over baseline Stable Diffusion models and alternative retrieval methods. Key results include: - Achieves a higher preference score (2:1) compared to baseline models in human evaluations. - Demonstrates better CLIP/FID Pareto efficiency, indicating superior visual fidelity and textual alignment. - Generates more diverse images per prompt, as evidenced by quantitative metrics (dFId) and VLM-based assessments. - Proves effective for various image-to-image tasks, including image translation and inpainting.	The paper acknowledges limitations and suggests areas for future work: - Task Blocking: Composer may not fully prevent adapters from overriding existing concepts within the prompt. - Task Diversity: Merging adapters may reduce diversity in generating instances of a single task. - Low-quality Adapters: Blacklisting low-quality adapters is challenging, and some might still be selected. - Retrieval Errors: Refiner and Composer may introduce errors, leading to suboptimal adapter choices. Future work could explore: - Developing more robust solutions to address task blocking and diversity. - Improving the accuracy and efficiency of the Refiner and Composer components. - Investigating alternative masking schemes for enhanced diversity.	diffusion_model, adapter, llm, analysis, image_generation, retrieval, lora
2308.14328	Reinforcement Learning for Generative AI: A Survey	Yuanjiang Cao, Quan Z. Sheng, Julian McAuley, Lina Yao	Deep Generative AI has been a long-standing essential topic in the machine learning community, which can impact a number of application areas like text generation and computer vision. The major paradigm to train a generative model is maximum likelihood estimation, which pushes the learner to capture and approximate the target data distribution by decreasing the divergence between the model distribution and the target distribution. This formulation successfully establishes the objective of generative tasks, while it is incapable of satisfying all the requirements that a user might expect from a generative model. Reinforcement learning, serving as a competitive option to inject new training signals by creating new objectives that exploit novel signals, has demonstrated its power and flexibility to incorporate human inductive bias from multiple angles, such as adversarial learning, hand-designed rules and learned reward model to build a performant model. Thereby, reinforcement learning has become a trending research field and has stretched the limits of generative AI in both model design and application. It is reasonable to summarize and conclude advances in recent years with a comprehensive review. Although there are surveys in different application areas recently, this survey aims to shed light on a high-level review that spans a range of application areas. We provide a rigorous taxonomy in this area and make sufficient coverage on various models and applications. Notably, we also surveyed the fast-developing large language model area. We conclude this survey by showing the potential directions that might tackle the limit of current models and expand the frontiers for generative AI.	This paper presents a comprehensive survey of how reinforcement learning (RL) is used in generative AI, analyzing its benefits, challenges, and applications across various domains.	This survey is important because it provides a structured overview of a rapidly developing field that bridges reinforcement learning and generative AI, offering insights for both newcomers and experienced researchers to understand current progress and future directions.	The authors reviewed a wide range of papers published in top conferences and journals, categorizing them based on how RL is used in generative tasks. They focused on applications involving sequential data generation, such as text, code, and molecules.	The survey highlights that RL is beneficial for handling non-differentiable objectives, introducing new training signals, improving sampling in energy-based models, and automating neural architecture search. The authors also identify challenges like peaked distributions, exploration-exploitation trade-offs, sparse rewards, long-term credit assignment, and generalization.	The paper points out several future research avenues, including reward function design for multi-objective optimization, model enhancement and control with RL, more sophisticated human preference modeling, addressing sample efficiency and generalization issues, incorporating novel RL algorithms, and understanding the implications of LLMs and foundation models.	reinforcement_learning, generative_ai, survey, text_generation, code_generation, molecule_design, natural_language_processing, computer_vision, neural_architecture_search, diffusion_model
2312.09168	DiffusionLight: Light Probes for Free by Painting a Chrome Ball	Pakkapon Phongthawee, Worameth Chinchuthakun, Nontaphat Sinsunthithet, Amit Raj, Varun Jampani, Pramook Khungurn, Supasorn Suwajanakorn	We present a simple yet effective technique to estimate lighting in a single input image. Current techniques rely heavily on HDR panorama datasets to train neural networks to regress an input with limited field-of-view to a full environment map. However, these approaches often struggle with real-world, uncontrolled settings due to the limited diversity and size of their datasets. To address this problem, we leverage diffusion models trained on billions of standard images to render a chrome ball into the input image. Despite its simplicity, this task remains challenging: the diffusion models often insert incorrect or inconsistent objects and cannot readily generate images in HDR format. Our research uncovers a surprising relationship between the appearance of chrome balls and the initial diffusion noise map, which we utilize to consistently generate high-quality chrome balls. We further fine-tune an LDR diffusion model (Stable Diffusion XL) with LoRA, enabling it to perform exposure bracketing for HDR light estimation. Our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios.	This paper introduces a novel technique, DiffusionLight, for estimating high dynamic range (HDR) lighting from a single image by leveraging pre-trained text-to-image diffusion models to inpaint a chrome ball into the scene and subsequently unwrapping its reflection to obtain an environment map.	The paper addresses the limitations of current lighting estimation methods that rely on limited HDR panorama datasets, resulting in poor generalization to real-world, uncontrolled settings. By harnessing the vast image prior of diffusion models trained on billions of standard images, DiffusionLight demonstrates superior generalization and handles diverse in-the-wild scenarios effectively.	The authors utilize a depth-conditioned Stable Diffusion XL model to inpaint chrome balls, addressing the challenge of generating high-quality reflections. They introduce an iterative inpainting algorithm to locate suitable initial noise maps for consistent ball generation. For HDR prediction, they fine-tune the model with LoRA to perform exposure bracketing, generating multiple LDR chrome balls at varying exposures which are then merged to produce a linearized HDR output.	DiffusionLight achieves competitive results on standard benchmarks (Laval Indoor and Poly Haven), outperforming StyleLight in terms of Angular Error and Normalized RMSE. Notably, it exhibits strong generalization to in-the-wild images where existing methods struggle. The ablation study confirms the contribution of both the iterative inpainting algorithm and LoRA fine-tuning for improved performance.	The paper acknowledges limitations such as the assumption of orthographic projection due to unknown camera parameters, occasional failure to reflect environments in overhead images, and the current slow processing time due to diffusion sampling. Future work includes addressing perspective projection, handling overhead views, and exploring faster sampling-efficient diffusion models.	diffusion_model, light_estimation, hdr, inpainting, lora, environment_map, generalization, in-the-wild
2312.09187	Vision-Language Models as a Source of Rewards	Kate Baumli, Satinder Baveja, Feryal Behbahani, Harris Chan, Gheorghe Comanici, Sebastian Flennerhag, Maxime Gazeau, Kristian Holsheimer, Dan Horgan, Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney, Volodymyr Mnih, Alexander Neitz, Fabio Pardo, Jack Parker-Holder, John Quan, Tim Rocktäschel, Himanshu Sahni, Tom Schaul, Yannick Schroecker, Stephen Spencer, Richie Steigerwald, Luyu Wang, Lei Zhang	Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.	This paper investigates the use of off-the-shelf vision-language models (VLMs), specifically CLIP, as reward functions for reinforcement learning agents in visual environments, enabling them to achieve language-specified goals.	This paper is important because it addresses a key challenge in building generalist RL agents: the need for numerous, manually-designed reward functions. Using VLMs as reward generators has the potential to significantly improve the scalability and efficiency of training agents that can perform diverse tasks in complex environments.	The authors propose a method to derive a binary reward signal from CLIP by: (1) computing the probability of goal achievement based on cosine similarity between image and text embeddings, and (2) thresholding this probability. They then use this reward to train RL agents in two visual domains: Playhouse and AndroidEnv, evaluating the agent's performance on achieving various language-specified goals.	The key findings suggest that maximizing the VLM-derived reward leads to an improvement in ground truth reward, indicating the effectiveness of VLMs as reward functions. The authors also show that larger VLMs lead to more accurate rewards and subsequently better agent performance. Furthermore, they demonstrate the importance of prompt engineering in improving the performance of the VLM reward model.	The paper acknowledges limitations regarding the potential for reward hacking, which was not observed within the scope of their experiments. Future work could explore generalizing negative sampling from generative distributions, such as LLMs. Additionally, exploring the impact of VLM advancements on training generalist agents without domain-specific fine-tuning is suggested.	diffusion_model, llm, rl, vision-language model, reward function, clip, playhouse, androidenv, prompt engineering
2308.10916	Diffusion Model as Representation Learner	Xingyi Yang, Xinchao Wang	Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks.Despite its promises, the learned representations of pre-trained DPMs, however, have not been fully understood. In this paper, we conduct an in-depth investigation of the representation power of DPMs, and propose a novel knowledge transfer method that leverages the knowledge acquired by generative DPMs for recognition tasks. Our study begins by examining the feature space of DPMs, revealing that DPMs are inherently denoising autoencoders that balance the representation learning with regularizing model capacity. To this end, we introduce a novel knowledge transfer paradigm named RepFusion. Our paradigm extracts representations at different time steps from off-the-shelf DPMs and dynamically employs them as supervision for student networks, in which the optimal time is determined through reinforcement learning. We evaluate our approach on several image classification, semantic segmentation, and landmark detection benchmarks, and demonstrate that it outperforms state-of-the-art methods. Our results uncover the potential of DPMs as a powerful tool for representation learning and provide insights into the usefulness of generative models beyond sample generation. The code is available at \url{https://github.com/Adamdad/Repfusion}.	This paper investigates the potential of Diffusion Probabilistic Models (DPMs) for representation learning and proposes RepFusion, a novel knowledge transfer method that leverages pre-trained DPMs to enhance performance in recognition tasks like image classification and semantic segmentation.	This paper is important because it explores the under-utilized representation learning capability of DPMs, going beyond their traditional generative applications. It offers a new perspective on leveraging pre-trained generative models for improved performance in discriminative tasks.	The authors first establish a theoretical connection between DPMs and denoising autoencoders, demonstrating the time-dependent nature of DPM latent space. They then introduce RepFusion, which uses reinforcement learning to dynamically select optimal time steps for distilling knowledge from a pre-trained DPM into a student network. This student network is then fine-tuned for specific recognition tasks.	RepFusion consistently outperforms baseline models and other self-supervised learning methods on various benchmarks, including CIFAR-10, Tiny-ImageNet, CelebAMask-HQ, and WFLW. Notably, it shows significant improvements in semantic segmentation, particularly in challenging scenarios with large pose variations and occlusions.	The paper acknowledges the limitations of existing work on utilizing DPMs for representation learning, such as complex model modifications. As future work, the authors suggest exploring the time-step selection strategy further. Additionally, they highlight the need for a deeper understanding of the relationship between the chosen time step and the specific downstream task.	diffusion_model, representation_learning, knowledge_distillation, semantic_segmentation, image_classification, landmark_detection, reinforcement_learning, analysis
2310.03502	Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion	Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov	Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.	This paper introduces Kandinsky, a novel text-to-image generation model based on latent diffusion architecture, combining image prior models with latent diffusion techniques, and demonstrates its capabilities through various generation modes and state-of-the-art performance on image generation quality.	This paper is important because it presents a novel approach to text-to-image generation using a combination of image prior and latent diffusion, achieves state-of-the-art performance on image generation quality, and provides a fully open-source implementation of the model and user-friendly tools like a web application and Telegram bot, making it accessible for various applications.	The authors developed Kandinsky by training an image prior model to map text embeddings to image embeddings of CLIP and utilizing a modified MoVQ implementation as the image autoencoder component. They conducted experiments on the COCO-30K dataset using FID-CLIP curves and human evaluation to assess the performance of different configurations, including various image prior setups and the effect of latent quantization.	Kandinsky achieved a FID score of 8.03 on the COCO-30K dataset, making it the top open-source performer in terms of measurable image generation quality. The study found that a simple linear mapping for image prior yielded the best FID score, suggesting a potential linear relationship between visual and textual embedding spaces. Additionally, quantization of latent codes in MoVQ slightly improved image quality.	Limitations mentioned include the need for further research to enhance the semantic coherence between text and generated images and improve FID scores and image quality based on human evaluation. Future work will focus on exploring newer image encoders, developing more efficient UNet architectures, improving text prompt understanding, generating higher-resolution images, and investigating new features like local image editing and addressing the potential for generating harmful content.	diffusion_model, text-to-image, image_generation, image_prior, latent_diffusion, movq, clip, fid, open-source, web_application, telegram_bot
2401.01335	Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models	Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu	Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents. Codes are available at https://github.com/uclaml/SPIN.	This paper proposes a new fine-tuning method called Self-Play fine-tuning (SPIN) for Large Language Models (LLMs) that leverages a self-play mechanism to improve a model's performance without requiring additional human-annotated data.	This paper is important because it offers a way to enhance LLM performance without the need for expensive and time-consuming data annotation beyond the initial fine-tuning dataset. It provides a theoretical analysis of the method's convergence and demonstrates its empirical effectiveness on various benchmark datasets.	The authors propose a self-play mechanism where an LLM acts as both the main player and the opponent. The main player is trained to distinguish between responses generated by the opponent (an older version of the LLM) and human-annotated data. This iterative process refines the LLM's ability to generate responses aligned with the target data distribution.	The paper shows SPIN significantly improves LLM performance on benchmarks like HuggingFace Open LLM Leaderboard and MT-Bench. Notably, SPIN outperforms methods like Direct Preference Optimization (DPO), which requires additional preference data, and achieves comparable results even at iteration 0. The paper also demonstrates the importance of iterative training and analyzes the impact of training data size.	The paper acknowledges a limitation in that the fixed target data distribution, derived from humans, limits the potential performance. Future work could explore dynamically changing target distributions to push LLM capabilities beyond human-level. Additionally, the authors suggest exploring methods to reduce the volume of synthetic data needed for training.	llm, fine-tuning, self-play, sft, dpo, analysis, benchmark
2312.06663	CAD: Photorealistic 3D Generation via Adversarial Distillation	Ziyu Wan, Despoina Paschalidou, Ian Huang, Hongyu Liu, Bokui Shen, Xiaoyu Xiang, Jing Liao, Leonidas Guibas	The increased demand for 3D data in AR/VR, robotics and gaming applications, gave rise to powerful generative pipelines capable of synthesizing high-quality 3D objects. Most of these models rely on the Score Distillation Sampling (SDS) algorithm to optimize a 3D representation such that the rendered image maintains a high likelihood as evaluated by a pre-trained diffusion model. However, finding a correct mode in the high-dimensional distribution produced by the diffusion model is challenging and often leads to issues such as over-saturation, over-smoothing, and Janus-like artifacts. In this paper, we propose a novel learning paradigm for 3D synthesis that utilizes pre-trained diffusion models. Instead of focusing on mode-seeking, our method directly models the distribution discrepancy between multi-view renderings and diffusion priors in an adversarial manner, which unlocks the generation of high-fidelity and photorealistic 3D content, conditioned on a single image and prompt. Moreover, by harnessing the latent space of GANs and expressive diffusion model priors, our method facilitates a wide variety of 3D applications including single-view reconstruction, high diversity generation and continuous 3D interpolation in the open domain. The experiments demonstrate the superiority of our pipeline compared to previous works in terms of generation quality and diversity.	This paper introduces Consistent Adversarial Distillation (CAD), a novel method for synthesizing high-quality, photorealistic 3D objects from a single image and text prompt by leveraging pre-trained 2D diffusion models and addressing limitations of existing score distillation methods.	This work is important because it overcomes limitations of previous 3D generation techniques, such as over-saturation, over-smoothing, and limited diversity, by directly modeling the distribution of a pre-trained diffusion model through adversarial learning, leading to higher quality and more diverse 3D object synthesis.	The authors propose a framework that uses a StyleGAN2-based generator to model the 3D distribution of objects, trained adversarially against a discriminator to match the distribution of a pre-trained 2D diffusion model. To ensure multi-view consistency and high-fidelity generation, they employ a two-stage training process with 2D and 3D upsampling branches, a camera pose pruning strategy for filtering inconsistent samples, and a distribution refinement step using additional diffusion models.	CAD generates high-fidelity 3D objects with photorealistic textures and fewer artifacts compared to existing methods like DreamFusion, ProlificDreamer, Magic123, and Zero-1-to-3. It also demonstrates superior performance in quantitative metrics like CLIP similarity score and qualitative evaluations including a user study, highlighting its ability to produce diverse and realistic 3D objects.	The authors acknowledge limitations in optimization speed due to volumetric rendering and suggest exploring efficient rendering techniques like Gaussian Splatting. They also propose future work on enabling multi-conditional generation and extending CAD to handle scene-level synthesis.	diffusion_model, gan, 3d, single-view reconstruction, photorealistic, adversarial_distillation
2405.00760	Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models	Xiaoshi Wu, Yiming Hao, Manyuan Zhang, Keqiang Sun, Zhaoyang Huang, Guanglu Song, Yu Liu, Hongsheng Li	Optimizing a text-to-image diffusion model with a given reward function is an important but underexplored research area. In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. We find that training earlier steps in the sampling process is crucial for low-level rewards, and deep supervision can be achieved efficiently and effectively by stopping the gradient of the denoising network input. DRTune is extensively evaluated on various reward models. It consistently outperforms other algorithms, particularly for low-level control signals, where all shallow supervision methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0) model via DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances image quality compared to SDXL 1.0 and reaches comparable quality compared with Midjourney v5.2.	This paper presents DRTune, a novel algorithm for efficiently fine-tuning text-to-image diffusion models using deep reward supervision, enabling optimization based on various reward functions like image aesthetics and symmetry.	This research is important because it addresses the challenge of optimizing diffusion models with complex reward functions, particularly those requiring deep supervision, which is crucial for controlling global image properties and improving generated image quality.	The authors propose DRTune, which employs two key techniques: 1) stopping gradients at denoising network inputs to prevent gradient explosion during back-propagation, and 2) training a strategically sampled subset of denoising steps to improve training efficiency. They compare DRTune with existing reward training methods on a variety of reward functions, including aesthetic scores, CLIPScore, PickScore, symmetry, compressibility, and objectness.	DRTune consistently outperforms baseline methods in optimizing various reward functions, particularly those demanding deep supervision for global image properties like symmetry. Additionally, the authors demonstrate the practical application of DRTune by fine-tuning Stable Diffusion XL 1.0 (SDXL 1.0) with the Human Preference Score v2.1 reward, creating Favorable Diffusion XL 1.0 (FDXL 1.0), which exhibits significantly improved image quality compared to SDXL 1.0 and even achieves comparable quality with Midjourney v5.2.	The authors acknowledge the limitations of reward-based training, specifically the risk of reward hacking, where models might prioritize optimizing the reward function at the expense of overall image quality. They suggest exploring regularization techniques to mitigate this issue. Additionally, they recognize the potential negative social impact of advanced generative models, such as the creation of highly plausible misinformation and the amplification of biases present in the training data. Future work could focus on developing more robust reward functions and exploring methods to mitigate potential biases in training data.	diffusion_model, reward, drtune, stable diffusion, image_generation, optimization, deep_learning, text-to-image
2404.05674	MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation	Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang	In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. Our work is open-source, thereby providing universal access to these advancements.	This paper introduces MoMA, an open-vocabulary and training-free personalized image generation model that excels in producing high-fidelity images with preserved object identity while adhering to text prompts.	This paper addresses the limitations of existing personalized image generation methods that require extensive tuning, are confined to specific domains, or lack detail fidelity. MoMA offers a more efficient and versatile approach to personalization by leveraging the power of MLLMs, making it more accessible and applicable to a wider range of image generation tasks.	MoMA employs a multi-modal LLM adapter with fine-grained feature transfer. It utilizes a generative multi-modal decoder to extract and modify image features from a reference image based on the target text prompt. It also extracts object features from a white-background version of the reference image using the UNet's self-attention layers. These features are then injected into a pre-trained UNet during image generation. The model is pre-trained in two stages: first, the multi-modal decoder is trained to generate contextualized image embeddings, and then the decoupled attention modules in the UNet are optimized.	MoMA demonstrates superior detail accuracy and faithfulness to the target object across varied backgrounds in recontextualization tasks. For texture modification, it effectively alters the texture while preserving other visual features. Notably, it achieves this without per-instance tuning, making it efficient and readily applicable. Experiments show MoMA outperforms existing tuning-free methods both qualitatively and quantitatively in terms of detail fidelity, identity preservation, and prompt adherence. It also demonstrates generalizability by successfully integrating with various community-trained diffusion models.	The paper acknowledges limitations in generating images with rare subjects or those containing text, where details might be lost. Future work could explore techniques to improve the model's ability to handle such cases. Additionally, the paper highlights the potential misuse of the model for creating deceptive content and suggests careful consideration and implementation of safeguards before widespread deployment.	diffusion_model, mllm, personalization, image_generation, open-vocabulary, tuning-free, image-to-image, self-attention
2309.07254	Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement	Chenghao Li, Dake Chen, Yuke Zhang, Peter A. Beerel	While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate' training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations. Code is available at https://github.com/HowardLi0816/dual-fusion-diffusion.	This paper tackles the privacy issue of data replication in diffusion models by proposing a method to quantify caption generality and a novel dual fusion enhancement training approach.	This paper is significant as it addresses the growing privacy concerns regarding diffusion models replicating training data, which is crucial for the responsible development and deployment of such models.	The authors introduce a "generality score" to measure caption generality and utilize LLMs to generate more general captions. They then propose a dual fusion enhancement approach that fuses specific object features with the original image in latent space and combines corresponding label embeddings with the caption. They evaluate their methods by fine-tuning Stable Diffusion v2.1 on a subset of LAION-2B and measuring replication score and FID.	The proposed method significantly reduces replication by 43.5% compared to the baseline and outperforms other mitigation strategies while maintaining comparable generation quality and diversity. The paper also shows that using generalized captions generated by LLMs effectively reduces replication.	The paper acknowledges a trade-off between reducing replication and maintaining image generation quality. Future work includes exploring the use of the generality score to guide caption generalization and iteratively enhance caption generality.	diffusion_model, privacy, data_replication, llm, caption_generation, generality, fusion, stable diffusion
2309.05793	PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models	Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing Liu, Kang Du, Min Zheng	Personalized text-to-image generation has emerged as a powerful and sought-after tool, empowering users to create customized images based on their specific concepts and prompts. However, existing approaches to personalization encounter multiple challenges, including long tuning times, large storage requirements, the necessity for multiple input images per identity, and limitations in preserving identity and editability. To address these obstacles, we present PhotoVerse, an innovative methodology that incorporates a dual-branch conditioning mechanism in both text and image domains, providing effective control over the image generation process. Furthermore, we introduce facial identity loss as a novel component to enhance the preservation of identity during training. Remarkably, our proposed PhotoVerse eliminates the need for test time tuning and relies solely on a single facial photo of the target identity, significantly reducing the resource cost associated with image generation. After a single training phase, our approach enables generating high-quality images within only a few seconds. Moreover, our method can produce diverse images that encompass various scenes and styles. The extensive evaluation demonstrates the superior performance of our approach, which achieves the dual objectives of preserving identity and facilitating editability. Project page: https://photoverse2d.github.io/	This paper introduces PhotoVerse, a novel method for personalized text-to-image generation that uses a dual-branch conditioning mechanism to enable fast generation and high-quality images using only a single reference image of the target identity.	This paper addresses the limitations of existing personalized text-to-image generation methods, such as long tuning times, large storage requirements, and the need for multiple input images. It offers a faster and more user-friendly approach for incorporating specific individuals into diverse scenes with high fidelity.	The paper proposes a dual-branch conditioning mechanism that combines improved identity textual embeddings and spatial concept cues through dual-modality adapters in both text and image domains. The method utilizes a pre-trained Stable Diffusion model and incorporates a novel facial identity loss component during training to enhance identity preservation. The approach employs lightweight adapters and fine-tunes only the cross-attention module of the UNet, resulting in fast and efficient personalization without the need for test-time tuning.	PhotoVerse demonstrates superior performance in preserving identity attributes while enabling image editing, stylization, and new scene generation. It achieves high identity similarity across diverse ethnicities and produces high-quality images with sharp details and natural aesthetics. The method eliminates the need for test-time tuning and generates images in just a few seconds using a single reference image, significantly improving efficiency compared to existing methods.	The authors acknowledge potential bias in pre-trained large models as a limitation. Future work could involve exploring methods to mitigate this bias and further enhance the generalization capabilities of the model. Additionally, incorporating control mechanisms for pose and composition could provide users with more fine-grained control over image generation.	diffusion_model, text-to-image, personalization, identity_preservation, fast_generation, single_image, dual-branch_conditioning, adapter, facial_identity_loss, image_editing, stylization
2312.08578	A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions	Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, Adriana Romero-Soriano	Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 8012 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.	The paper introduces the Densely Captioned Images (DCI) dataset, a collection of 8012 natural images with human-annotated, mask-aligned descriptions averaging over 1000 words each, enabling the evaluation of vision-language models' understanding of fine-grained image details.	This paper is important because it addresses the limitations of existing vision-language datasets that rely on short, loosely-aligned captions, hindering the development and evaluation of models capable of deep visual-linguistic understanding. The introduction of DCI with its dense and aligned captions provides a valuable resource for benchmarking and advancing vision-language models.	The authors first preprocessed images from the SA-1B dataset using the Segment Anything Model (SAM) to extract hierarchical submasks. Then, they employed a multi-stage crowdsourcing approach with qualification tasks and iterative feedback to ensure high-quality annotations. To fit existing model limitations, they used LLaMA2 to generate summarized captions and negatives within CLIP's token limit, resulting in the summarized DCI (sDCI) dataset. Finally, they evaluated several state-of-the-art VLMs on sDCI using novel benchmark tasks like Subcrop-Caption Matching (SCM) and negatives-based tests.	The results show that existing VLMs, even those trained with negatives or dense captions, struggle to accurately match captions to corresponding subregions within an image, highlighting limitations in fine-grained understanding. Additionally, fine-tuning CLIP on sDCI significantly improved performance on benchmarks like ARO and VL-Checklist, outperforming models trained on significantly larger but loosely-aligned datasets like DAC. These findings underscore the importance of dense and aligned image-text pairs for effective VLM training.	The authors acknowledge limitations in using LLM-generated summaries, which may not capture all the nuances of the full annotations, and the limited text context length of current VLMs. They suggest future work exploring models with larger context windows to leverage the full DCI dataset, and investigating techniques like bitext mining to expand the dataset further.	diffusion_model, llm, analysis, 3d, adversarial_attack, interpretability
2403.12143	Graph Neural Networks for Learning Equivariant Representations of Neural Networks	Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J. Burghouts, Efstratios Gavves, Cees G. M. Snoek, David W. Zhang	Neural networks that process the parameters of other neural networks find applications in domains as diverse as classifying implicit neural representations, generating neural network weights, and predicting generalization errors. However, existing approaches either overlook the inherent permutation symmetry in the neural network or rely on intricate weight-sharing patterns to achieve equivariance, while ignoring the impact of the network architecture itself. In this work, we propose to represent neural networks as computational graphs of parameters, which allows us to harness powerful graph neural networks and transformers that preserve permutation symmetry. Consequently, our approach enables a single model to encode neural computational graphs with diverse architectures. We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations, predicting generalization performance, and learning to optimize, while consistently outperforming state-of-the-art methods. The source code is open-sourced at https://github.com/mkofinas/neural-graphs.	This paper introduces a novel approach to representing neural networks as computational graphs of parameters called 'neural graphs', which allows for leveraging powerful graph neural networks and transformers while preserving permutation symmetry.	This research is important because it addresses limitations of existing methods that process neural network parameters, such as overlooking inherent permutation symmetry or relying on complex weight-sharing patterns. By representing neural networks as graphs, this approach allows a single model to learn from diverse architectures and opens up new possibilities for applications like neural network analysis, generation, and optimization.	The authors represent neural networks as graphs by mapping neurons to nodes and connections to edges, with weights and biases as edge and node features respectively. This representation is then used as input to graph neural networks (GNNs) or transformers, adapted to incorporate inductive biases from the neural graph structure. They validate their approach on various tasks including implicit neural representation classification and editing, predicting generalization performance of CNNs, and learning to optimize.	The proposed neural graph approach consistently outperforms state-of-the-art methods on tasks like INR classification and style editing, showing significant improvement over previous methods like DWSNet and NFN. It also demonstrates superior performance in predicting CNN generalization, especially when dealing with diverse architectures where accounting for both parameters and architecture is crucial. Furthermore, the method shows promise in the field of learning to optimize, achieving strong performance on both validation and test tasks.	The authors acknowledge limitations in terms of architectural diversity explored, focusing mainly on MLPs and CNNs. Future work could investigate the representation of other architectures like transformers. Additionally, the strong performance on INRs is currently limited to 2D images, and extending it to handle 3D representations like neural radiance fields is an area for further exploration.	diffusion_model, gan, analysis, 3d, interpretability, neural_network, graph_neural_network, transformer, representation_learning, permutation_symmetry, implicit_neural_representation, generalization, learning_to_optimize
2405.02730	U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers	Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang	Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention and bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. Codes are available at https://github.com/YuchuanTian/U-DiT.	This paper introduces U-DiT, a U-Net architecture diffusion transformer model for latent-space image generation that leverages token downsampling in self-attention to improve performance and reduce computational cost compared to isotropic DiT models.	The paper is important as it challenges the prevailing use of isotropic architectures in diffusion transformers by demonstrating the potential of U-Net architecture combined with a novel downsampled self-attention mechanism, leading to state-of-the-art performance with reduced computational costs.	The authors conducted a toy experiment comparing a simple U-Net DiT with an isotropic DiT and found that while U-Net offered some benefits, it was underutilized. They then introduced downsampled self-attention, reducing redundancy by focusing on low-frequency components in the U-Net backbone. They scaled this model up, creating U-DiT and evaluating it against existing DiT models on ImageNet 256x256, measuring FID, sFID, IS, precision, and recall.	U-DiT significantly outperforms isotropic DiTs, achieving better FID scores with fewer FLOPs. For example, U-DiT-B surpasses DiT-XL/2 in performance with only 1/6th of the computational cost. This highlights the efficacy of the U-Net architecture and downsampled self-attention for efficient and high-quality image generation.	The authors acknowledge limitations in exploring the full potential of U-DiTs due to computational resource constraints and a tight schedule, suggesting further scaling of model size and extending training iterations as future work.	diffusion_model, transformer, u-net, image_generation, latent_space, self-attention, downsampling, computational_efficiency
2403.18978	TextCraftor: Your Text Encoder Can be Image Quality Controller	Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, Jian Ren	Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.	This paper introduces TextCraftor, a novel method for enhancing text-to-image diffusion models by fine-tuning the text encoder using reward functions, leading to improved image quality and text-image alignment.	This paper is important because it addresses the limitations of existing text-to-image diffusion models in generating images that accurately reflect input text prompts. It offers a more efficient alternative to replacing the entire text encoder or relying on manual prompt engineering, which are computationally expensive or require human effort.	The authors propose two techniques: 1) Directly fine-tuning with reward: This involves using a reward model to directly assess the quality of images generated from noisy latents. 2) Prompt-based fine-tuning: This addresses limitations of the first technique by using the denoising process to obtain a more accurate final image for reward prediction. They utilize various reward functions like aesthetic scores, text-image alignment scores, and CLIP similarity to guide the fine-tuning process.	TextCraftor significantly improves image quality and text-image alignment compared to baseline models like SDv1.5 and SDv2.0, even outperforming larger models like SDXL Base 0.9 and DeepFloyd-XL in some aspects. It achieves better quantitative scores on Parti-Prompts and HPSv2 benchmarks, and human evaluations confirm the superiority of generated images. TextCraftor also enables controllable image generation through interpolation of different fine-tuned text encoders.	The authors acknowledge limitations in reward models and the potential for mode collapse. They suggest exploring encoding reward function styles into text encoder tokens as future work.	diffusion_model, text-to-image, image_generation, text_encoder, fine-tuning, reward_function, controllable_generation
2312.05239	SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation	Thuan Hoang Nguyen, Anh Tran	Despite their ability to generate high-resolution and diverse images from text prompts, text-to-image diffusion models often suffer from slow iterative sampling processes. Model distillation is one of the most effective directions to accelerate these models. However, previous distillation methods fail to retain the generation quality while requiring a significant amount of images for training, either from real data or synthetically generated by the teacher model. In response to this limitation, we present a novel image-free distillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration from text-to-3D synthesis, in which a 3D neural radiance field that aligns with the input prompt can be obtained from a 2D text-to-image diffusion prior via a specialized loss without the use of any 3D data ground-truth, our approach re-purposes that same loss for distilling a pretrained multi-step text-to-image model to a student network that can generate high-fidelity images with just a single inference step. In spite of its simplicity, our model stands as one of the first one-step text-to-image generators that can produce images of comparable quality to Stable Diffusion without reliance on any training image data. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and a CLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitive results or even substantially surpassing existing state-of-the-art distillation techniques.	This paper presents SwiftBrush, a novel image-free distillation method for text-to-image diffusion models that enables single-step, high-fidelity image generation without relying on any training image data.	This paper is important because it addresses the slow inference speed of traditional text-to-image diffusion models by enabling single-step generation while maintaining high fidelity, which is crucial for deployment on consumer devices and broader accessibility.	The authors draw inspiration from text-to-3D synthesis techniques and adapt Variational Score Distillation (VSD) for text-to-image generation. They employ a pretrained text-to-image teacher model and an additional trainable LoRA teacher model to guide the learning of a student model that can generate images from text prompts in a single step. The student model is trained without using any image data, relying solely on text captions and a specialized loss function.	SwiftBrush achieves promising zero-shot results on benchmarks like COCO 2014 and Human Preference Score v2, surpassing existing one-step image generation methods in quality while being more efficient and requiring significantly less training time. Notably, SwiftBrush achieves an FID score of 16.67 and a CLIP score of 0.29 on the COCO-30K benchmark without using any training image data.	The authors acknowledge that SwiftBrush, while efficient, may produce slightly lower quality images compared to multi-step teacher models. Future work could focus on extending SwiftBrush to support few-step generation, exploring single-teacher distillation, and integrating techniques like DreamBooth, ControlNet, or InstructPix2Pix for enhanced control and application.	diffusion_model, distillation, text-to-image, one-step generation, image-free, gan, nerf, sds, vsd, lora
2404.18861	A Survey on Vision Mamba: Models, Applications and Challenges	Rui Xu, Shu Yang, Yihui Wang, Bo Du, Hao Chen	Mamba, a recent selective structured state space model, performs excellently on long sequence modeling tasks. Mamba mitigates the modeling constraints of convolutional neural networks and offers advanced modeling capabilities similar to those of Transformers, through global receptive fields and dynamic weighting. Crucially, it achieves this without incurring the quadratic computational complexity typically associated with Transformers. Due to its advantages over the former two mainstream foundation models, Mamba exhibits great potential to be a visual foundation model. Researchers are actively applying Mamba to various computer vision tasks, leading to numerous emerging works. To help keep pace with the rapid advancements in computer vision, this paper aims to provide a comprehensive review of visual Mamba approaches. This paper begins by delineating the formulation of the original Mamba model. Subsequently, our review of visual Mamba delves into several representative backbone networks to elucidate the core insights of the visual Mamba. We then categorize related works using different modalities, including image, video, point cloud, multi-modal, and others. Specifically, for image applications, we further organize them into distinct tasks to facilitate a more structured discussion. Finally, we discuss the challenges and future research directions for visual Mamba, providing insights for future research in this quickly evolving area. A comprehensive list of visual Mamba models reviewed in this work is available at https://github.com/Ruixxxx/Awesome-Vision-Mamba-Models.	This paper presents a comprehensive survey of Vision Mamba, a novel and efficient neural network architecture for visual tasks, exploring its underlying principles, diverse applications across various visual domains, and outlining future research directions.	This survey is important because it examines the rapid advancements and growing influence of Vision Mamba in the computer vision field, providing a timely and valuable resource for researchers to understand the core concepts, explore applications, and contribute to its ongoing development.	The authors provide a structured analysis of Vision Mamba by first introducing its foundational principles, followed by in-depth examinations of representative backbone networks and categorizing its applications based on visual modalities such as image, video, multi-modal data, and point clouds. The paper concludes by critically analyzing the challenges and outlining future research directions.	The paper highlights Vision Mamba's effectiveness in various computer vision tasks, including classification, segmentation, generation, and restoration, across diverse domains like medical imaging, remote sensing, and video understanding. It also provides insights into how different visual Mamba models address the unique characteristics of visual data and discusses their performance compared to traditional convolutional neural networks and Transformers.	The paper identifies key limitations of Vision Mamba, including stability issues when scaling to large datasets, challenges in adapting causal scanning mechanisms to non-causal visual data, potential loss of spatial information during 1D scanning, information redundancy and increased computational demands due to multi-directional scanning, and the need for enhanced interpretability, generalization ability, and robustness. Future research directions include developing more efficient scanning techniques and fusion methods, optimizing computational efficiency, and exploring applications in data-efficient learning, high-resolution data analysis, multi-modal learning, and in-context learning.	diffusion_model, llm, analysis, literature_review, 3d, motion, video, interpretability, vision_transformer, state_space_model
2310.12036	A General Theoretical Paradigm to Understand Learning from Human Preferences	Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos	The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.	This paper presents a theoretical framework, called Ψ-preference optimization (ΨΠΟ), for learning from human preferences, unifying existing methods like RLHF and DPO and highlighting potential pitfalls.	The paper addresses the lack of theoretical understanding of current preference learning methods, despite their practical success, particularly in aligning large language models with human preferences.	The authors introduce ΨPO as a general objective function, analyze specific cases like RLHF and DPO, identify potential overfitting issues, and propose a simplified variant, Identity-ΠΟ (IΠΟ), with a computationally efficient algorithm.	The paper shows that ΨPO generalizes RLHF and DPO, both vulnerable to overfitting due to their reliance on the Bradley-Terry model. The proposed IΠΟ method, using the identity mapping in ΨPO, avoids overfitting by directly optimizing regularized total preferences. Experiments on illustrative bandit examples demonstrate IΠΟ's improved stability and adherence to the reference policy compared to DPO.	While the paper provides a theoretical analysis and illustrative examples, future work should focus on scaling up IΠΟ to more complex scenarios, such as training large language models on human preference data, to assess its real-world effectiveness.	rlhf, dpo, llm, analysis, preference learning, overfitting, regularization, bandit, optimization
2401.06805	Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning	Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang	Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.	This paper surveys the current state of multimodal reasoning in Multimodal Large Language Models (MLLMs), exploring their architectures, training methods, and performance on various reasoning tasks.	This paper is important because it provides a comprehensive overview of the rapidly developing field of MLLMs, focusing specifically on their reasoning abilities which are crucial for achieving artificial general intelligence.	The authors reviewed existing literature on MLLMs, analyzed their architectures, training datasets, and performance on various reasoning benchmarks, and categorized the applications of these models.	The paper highlights that while MLLMs have shown impressive capabilities in multimodal tasks, their reasoning abilities still lag behind proprietary models like GPT-4V. The authors identified key factors contributing to the superior performance of some MLLMs, including unfreezing the language model during training, improving visual representations, and utilizing multi-task supervised learning.	The paper points out limitations in current MLLM architectures, training efficiency, long-context support, instruction fine-tuning data, and evaluation benchmarks. It suggests future research directions, including developing more robust architectures, efficient training methods, long-context support mechanisms, improved instruction datasets, and more comprehensive evaluation benchmarks.	mllm, llm, multimodal_reasoning, instruction_tuning, in-context_learning, analysis, literature_review, embodied_ai, tool_usage
2404.19227	Espresso: Robust Concept Filtering in Text-to-Image Models	Anudeep Das, Vasisht Duddu, Rui Zhang, N. Asokan	Diffusion-based text-to-image (T2I) models generate high-fidelity images for given textual prompts. They are trained on large datasets scraped from the Internet, potentially containing unacceptable concepts (e.g., copyright infringing or unsafe). Retraining T2I models after filtering out unacceptable concepts in the training data is inefficient and degrades utility. Hence, there is a need for concept removal techniques (CRTs) which are effective in removing unacceptable concepts, utility-preserving on acceptable concepts, and robust against evasion with adversarial prompts. None of the prior filtering and fine-tuning CRTs satisfy all these requirements simultaneously. We introduce Espresso, the first robust concept filter based on Contrastive Language-Image Pre-Training (CLIP). It identifies unacceptable concepts by projecting the generated image's embedding onto the vector connecting unacceptable and acceptable concepts in the joint text-image embedding space. This ensures robustness by restricting the adversary to adding noise only along this vector, in the direction of the acceptable concept. Further fine-tuning Espresso to separate embeddings of acceptable and unacceptable concepts, while preserving their pairing with image embeddings, ensures both effectiveness and utility. We evaluate Espresso on eleven concepts to show that it is effective (~5% CLIP accuracy on unacceptable concepts), utility-preserving (~93% normalized CLIP score on acceptable concepts), and robust (~4% CLIP accuracy on adversarial prompts for unacceptable concepts). Finally, we present theoretical bounds for the certified robustness of Espresso against adversarial prompts, and an empirical analysis.	This paper introduces \method, a robust concept filtering technique for text-to-image (TTI) models that uses Contrastive Language-Image Pre-Training (CLIP) to identify and suppress the generation of unacceptable concepts in images.	This paper addresses the crucial need for robust and utility-preserving concept removal techniques in TTI models. The presence of unacceptable concepts (e.g., copyrighted material, inappropriate content) in TTI outputs poses significant ethical and legal challenges. Existing methods either compromise utility for effectiveness or lack robustness against adversarial prompts. \method offers a novel approach that balances all three requirements, making it a valuable contribution to the field of safe and responsible TTI generation.	The authors developed \method by modifying CLIP's classification objective to consider the cosine similarity of a generated image's embedding to both acceptable and unacceptable concept embeddings. This projection onto a lower-dimensional vector connecting the concepts enhances robustness. Further, they fine-tune \method to separate embeddings of acceptable and unacceptable concepts while preserving their pairing with image embeddings, ensuring effectiveness and utility. They evaluate \method's performance on eleven concepts, comparing it to six state-of-the-art fine-tuning concept removal techniques and one filtering technique. They also present theoretical bounds for certified robustness and empirical analysis.	\method demonstrates effectiveness in suppressing unacceptable concepts, achieving a low CLIP accuracy on unacceptable prompts. It maintains high utility on acceptable prompts, showing comparable normalized CLIP scores to other techniques. Importantly, \method exhibits strong robustness against various adversarial attacks, including Typo+, PEZ+, CCE/CCE+, and RingBell+, outperforming existing techniques. The empirical evaluation of certified robustness further supports \method's resilience to adversarial noise in image embeddings.	The paper acknowledges the limitations of the current certified robustness bound, which is loose compared to the distance between acceptable and unacceptable images. Future work involves tightening this bound and exploring adversarial training to further enhance robustness. Additionally, the paper suggests extending \method to handle multiple concept filtering simultaneously and optimizing it for filtering artistic styles, which currently poses a challenge due to the similarity of concept embeddings.	diffusion_model, tti, clip, analysis, adversarial_attack, interpretability, robustness, concept_filtering, safety
2403.17377	Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance	Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, Seungryong Kim	Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.	This paper introduces Perturbed-Attention Guidance (PAG), a novel sampling guidance method for diffusion models that enhances sample quality by perturbing self-attention maps during the denoising process, eliminating the need for additional training or external modules.	The paper addresses the limitations of existing guidance techniques like Classifier Guidance (CG) and Classifier-Free Guidance (CFG), which often lack applicability in unconditional generation or specific downstream tasks. PAG offers a more versatile approach, enhancing sample quality in both conditional and unconditional scenarios without requiring extra training or external components.	PAG leverages the observation that self-attention maps in diffusion U-Nets capture structural information. The method perturbs these maps by replacing them with identity matrices, creating intermediate samples with degraded structures. This 'undesirable' path guides the denoising process towards generating samples with superior structural coherence and realism.	PAG significantly improves sample quality in both ADM and Stable Diffusion, evident in enhanced FID and IS scores, particularly in unconditional generation where CFG is inapplicable. PAG also complements CFG, leading to further quality improvements when used in conjunction. The method's efficacy extends to downstream tasks like image restoration (PSLD) and spatially conditioned generation (ControlNet), demonstrating its versatility.	The authors acknowledge limitations such as potential over-saturation at high guidance scales and the computational overhead of two forward passes per generation step. Future work could focus on mitigating these limitations by exploring techniques for efficient guidance computation and hyperparameter optimization.	diffusion_model, guidance, self-attention, unconditional_generation, image_restoration, controlnet, sample_quality, pag
2311.16973	DemoFusion: Democratising High-Resolution Image Generation With No $$$	Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma	High-resolution image generation with Generative Artificial Intelligence (GenAI) has immense potential but, due to the enormous capital investment required for training, it is increasingly centralised to a few large corporations, and hidden behind paywalls. This paper aims to democratise high-resolution GenAI by advancing the frontier of high-resolution generation while remaining accessible to a broad audience. We demonstrate that existing Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution image generation. Our novel DemoFusion framework seamlessly extends open-source GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated Sampling mechanisms to achieve higher-resolution image generation. The progressive nature of DemoFusion requires more passes, but the intermediate results can serve as "previews", facilitating rapid prompt iteration.	This paper introduces DemoFusion, a method for generating high-resolution images from pre-trained Latent Diffusion Models (LDMs) like SDXL without requiring additional training.	This paper is important because it addresses the increasing centralization and paywalling of high-resolution image generation by enabling access to this technology using consumer-grade hardware and open-source models.	The authors propose DemoFusion, which extends MultiDiffusion with three key mechanisms: Progressive Upscaling for iteratively enhancing image resolution, Skip Residual for maintaining global consistency, and Dilated Sampling for increasing global semantic coherence during image generation.	DemoFusion generates high-resolution images with better quality and coherence compared to baselines like MultiDiffusion and SDXL+BSRGAN, as evidenced by qualitative and quantitative comparisons using metrics such as FID, IS, and CLIP score.	Limitations include longer inference time due to progressive upscaling and dependence on the underlying LDM's performance. Future work could involve training LDMs specifically for DemoFusion or exploring more efficient inference strategies.	diffusion_model, image_generation, high_resolution, sdxl, progressive_upscaling, skip_residual, dilated_sampling
2311.18828	One-step Diffusion with Distribution Matching Distillation	Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, Taesung Park	Diffusion models generate high-quality images but require dozens of forward passes. We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence whose gradient can be expressed as the difference between 2 score functions, one of the target distribution and the other of the synthetic distribution being produced by our one-step generator. The score functions are parameterized as two diffusion models trained separately on each distribution. Combined with a simple regression loss matching the large-scale structure of the multi-step diffusion outputs, our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k, comparable to Stable Diffusion but orders of magnitude faster. Utilizing FP16 inference, our model generates images at 20 FPS on modern hardware.	This paper introduces Distribution Matching Distillation (DMD), a method for converting a diffusion model into a one-step image generator with minimal quality loss by minimizing the KL divergence between real and generated image distributions using a pair of diffusion models.	This paper is important because it addresses the slow sampling speed of diffusion models, enabling near-real-time image generation with quality comparable to traditional multi-step methods.	The authors train a one-step generator with a distribution matching loss, estimated from scores derived from two diffusion models, and a regression loss based on a pre-computed dataset of noise-image pairs from the original diffusion model.	DMD outperforms existing diffusion distillation techniques, achieving FIDs of 2.62 on ImageNet 64x64 and 11.49 on zero-shot COCO-30k, comparable to Stable Diffusion but significantly faster (20 FPS).	Limitations include a minor quality gap compared to multi-step diffusion and challenges in generating text and fine details. Future work involves distilling more advanced models and exploring variable guidance scales.	diffusion_model, distillation, image_generation, text-to-image, one-step, kl_divergence, score_matching
2405.05846	Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models	Zhe Ma, Xuhong Zhang, Qingming Li, Tianyu Du, Wenzhi Chen, Zonghui Wang, Shouling Ji	The past few years have witnessed substantial advancement in text-guided image generation powered by diffusion models. However, it was shown that text-to-image diffusion models are vulnerable to training image memorization, raising concerns on copyright infringement and privacy invasion. In this work, we perform practical analysis of memorization in text-to-image diffusion models. Targeting a set of images to protect, we conduct quantitive analysis on them without need to collect any prompts. Specifically, we first formally define the memorization of image and identify three necessary conditions of memorization, respectively similarity, existence and probability. We then reveal the correlation between the model's prediction error and image replication. Based on the correlation, we propose to utilize inversion techniques to verify the safety of target images against memorization and measure the extent to which they are memorized. Model developers can utilize our analysis method to discover memorized images or reliably claim safety against memorization. Extensive experiments on the Stable Diffusion, a popular open-source text-to-image diffusion model, demonstrate the effectiveness of our analysis method.	This paper presents a practical method for analyzing memorization in text-to-image diffusion models, focusing on identifying and quantifying the extent to which specific images are memorized.	This paper addresses the risk of copyright infringement and privacy violation posed by memorization in text-to-image diffusion models trained on massive datasets. It offers a practical tool for model developers to assess and mitigate these risks, contributing to responsible AI development.	The authors define three conditions for memorization: similarity, existence, and probability. They propose using the model's prediction error as a measure of image replication (similarity). To find prompts that trigger memorization (existence), they develop a prompt inversion algorithm with regularization to ensure realistic token embeddings. Lastly, they measure the extent of memorization (probability) by comparing the prediction error distribution of the target image under the inverted prompt with that of a safe, unconditional diffusion model.	The paper demonstrates that the model's prediction error effectively identifies image replication. The proposed prompt inversion method can successfully trigger memorization for a significant portion of known memorized images. Moreover, the analysis reveals that unconditional diffusion models are generally safe from memorization, validating their use as a baseline for measuring memorization in conditional models.	The authors acknowledge two limitations. First, their hard prompt inversion algorithm, although outperforming existing methods, is not entirely foolproof, especially for images requiring multiple key tokens. Second, the analysis focuses on text-to-image models, with further research needed for other conditional diffusion models. Future work could focus on improving hard prompt inversion and expanding the analysis to different types of conditional diffusion models.	diffusion_model, memorization, analysis, text-to-image, security, privacy, copyright, inversion
2403.06634	Stealing Part of a Production Language Model	Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, David Rolnick, Florian Tramèr	We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \$20 USD, our attack extracts the entire projection matrix of OpenAI's Ada and Babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under \$2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.	This paper seems to be about a new security vulnerability of large language models (LLMs) where attackers can extract sensitive information like hidden dimensions and even potentially reconstruct the entire model architecture by analyzing just one layer's weights.	This paper appears to be important because it exposes a critical security flaw in LLMs. Successfully demonstrating that even a single layer's weights can be used to compromise model security has significant implications for data privacy and model robustness. This knowledge is crucial for developing stronger defense mechanisms and understanding the broader security landscape of LLMs.	While the exact methodology is not fully described in the provided text, it seems the authors plan to: 1) Formalize a threat model for these attacks. 2) Mathematically justify how extraction attacks are possible. 3) Describe various attack settings (sections 4.1-4.5). 4) Evaluate the attacks with white box attack results and test robustness against noise. 5) Analyze the impact of specific components like layer normalization.	While definitive results aren't stated, the paper seems to suggest successful attacks are possible: - Extracting information from just one layer is possible and impactful. - Layer normalization might introduce additional vulnerabilities by adding another dimension for exploitation. - The attacks might be robust even against noise.	The provided text outlines these limitations and future work: - Limitations: The effectiveness of defenses against this attack needs further investigation. The impact of obtaining only large singular values for the embedding matrix is unclear. - Future Work: Explore defenses based on noise injection and other mitigation strategies. Investigate the potential of utilizing the embedding layer for other types of attacks. Examine the possibility of bypassing output filters.	llm, security, adversarial_attack, model_extraction, vulnerability, layer_normalization, defense, robustness
2312.06655	Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior	Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, Yueqi Duan	Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.	This paper introduces Sherpa3D, a novel text-to-3D generation framework that leverages coarse 3D priors from 3D diffusion models to guide 2D diffusion models, achieving high-fidelity, diversified, and multi-view consistent 3D content.	The paper addresses limitations in existing text-to-3D methods, which often struggle with either limited generalizability and quality (3D diffusion models) or multi-view inconsistency (2D lifting methods). Sherpa3D bridges this gap by combining the strengths of both approaches, offering a promising solution for efficient and high-quality 3D content creation.	Sherpa3D employs a three-stage process: 1) It generates a coarse 3D prior using a 3D diffusion model. 2) It introduces structural and semantic guidance mechanisms derived from the 3D prior to guide the 2D lifting optimization. 3) It integrates the 3D guidance with a score distillation sampling (SDS) loss, using an annealing technique to balance the influence of 3D guidance and 2D refinement. This process enables Sherpa3D to produce detailed and consistent 3D objects from text prompts.	Sherpa3D demonstrates superior performance over existing text-to-3D methods in both qualitative and quantitative evaluations. It generates high-fidelity 3D assets with compelling texture quality and multi-view consistency, outperforming baselines in terms of CLIP R-Precision and user-rated quality and consistency. The authors show that Sherpa3D is efficient, taking only 25 minutes to generate a 3D model from a text prompt.	The authors acknowledge that the quality of Sherpa3D's output is inherently limited by the underlying 2D and 3D diffusion models used. Future work could explore leveraging larger, more advanced diffusion models (e.g., SDXL, DeepFloyd) to further enhance the generation quality. Additionally, the authors are interested in extending Sherpa3D's capabilities to more complex and creative tasks, such as text-to-4D generation.	diffusion_model, 3d, text-to-3d, multi-view consistency, generative_model, score_distillation_sampling
2308.09351	RLIPv2: Fast Scaling of Relational Language-Image Pre-training	Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao	Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.	This paper introduces RLIPv2, an improved model for Relational Language-Image Pre-training (RLIP) that focuses on fast convergence and scalability by using large-scale pseudo-labelled scene graph data.	This research is important because it addresses the limitations of RLIPv1, the previous iteration, which struggled with slow convergence and limited scene graph data. By enabling efficient scaling, RLIPv2 pushes the boundaries of relational reasoning in computer vision, achieving state-of-the-art results in tasks like HOI detection and scene graph generation.	The authors introduce Asymmetric Language-Image Fusion (ALIF) for faster convergence, employing sparse language encoding and early fusion. To generate large-scale pseudo-labelled scene graph data, they combine object detection datasets with BLIP-generated captions and a Relation Tagger built on RLIPv2 itself. They conduct extensive experiments on datasets like HICO-DET, V-COCO, and Open Images v6, comparing various settings like zero-shot, few-shot, and fully-finetuned learning.	RLIPv2 demonstrates superior performance across HOI detection and Scene Graph Generation benchmarks. Notably, it achieves state-of-the-art results on Open Images v6 for SGG and impressive zero-shot, few-shot, and fully-finetuned results on HICO-DET, demonstrating significant data efficiency and exceeding previous methods. For example, the largest RLIPv2 achieves 23.29mAP on HICO-DET without fine-tuning, 32.22mAP with 1% data, and 45.09mAP with full data.	The authors acknowledge the reliance on external captioner quality as a limitation, where noisy captions can impact performance. Future work includes exploring advanced captioning techniques for higher-quality pseudo-labels and investigating methods to overcome challenges posed by complex scenes with multiple similar objects.	diffusion_model, llm, analysis, 3d, motion, video, interpretability
2405.04312	Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer	Zhuoyi Yang, Heyang Jiang, Wenyi Hong, Jiayan Teng, Wendi Zheng, Yuxiao Dong, Ming Ding, Jie Tang	Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 40964096), the resolution of generated images is often limited to 10241024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is https://github.com/THUDM/Inf-DiT.	This paper introduces Inf-DiT, a memory-efficient diffusion transformer model for upsampling images to ultra-high resolutions by leveraging a novel Unidirectional Block Attention (UniBA) mechanism to process images in smaller blocks, thereby significantly reducing memory requirements.	This work addresses the critical limitation of existing diffusion models in generating ultra-high-resolution images due to quadratic memory scaling. Inf-DiT offers a solution by enabling the generation of images at resolutions exceeding 4096x4096 pixels, which was previously infeasible due to memory constraints, opening possibilities for various applications requiring high-fidelity visuals.	The authors propose UniBA, which divides images into blocks and processes them sequentially in batches, minimizing the number of hidden states in memory at any given time. Inf-DiT incorporates this mechanism into a diffusion transformer architecture, utilizing techniques like global CLIP image embedding for semantic consistency and nearby LR cross-attention for local detail preservation. Trained on a dataset of high-resolution images and evaluated on benchmarks like HPDv2 and DIV2K, Inf-DiT demonstrates superior performance in image upsampling and super-resolution tasks.	Inf-DiT achieves state-of-the-art performance on ultra-high resolution image generation (up to 4096x4096) as measured by FID and FIDcrop metrics, outperforming baselines like SDXL, MultiDiffusion, and DemoFusion. It also excels in classic super-resolution benchmarks on the DIV2K dataset, surpassing models like BSRGAN and StableSR. Human evaluations confirm Inf-DiT's superiority in detail authenticity, global coherence, and consistency with low-resolution inputs. Notably, it maintains a low memory footprint, approximately 5 times lower than SDXL when generating 4096x4096 images.	The authors acknowledge limitations in iterative upsampling, where errors from earlier stages can propagate and be difficult to correct in later stages. Future work could explore techniques for error correction and improved handling of long-range dependencies during iterative upsampling. Additionally, investigating the application of UniBA to other diffusion-based tasks beyond image generation could be a promising direction.	diffusion_model, transformer, super-resolution, image_generation, ultra-high-resolution, memory_efficient, uniba, inf-dit, clip
2312.05491	Using Captum to Explain Generative Language Models	Vivek Miglani, Aobo Yang, Aram H. Markosyan, Diego Garcia-Olano, Narine Kokhlikyan	Captum is a comprehensive library for model explainability in PyTorch, offering a range of methods from the interpretability literature to enhance users' understanding of PyTorch models. In this paper, we introduce new features in Captum that are specifically designed to analyze the behavior of generative language models. We provide an overview of the available functionalities and example applications of their potential for understanding learned associations within generative language models.	This paper introduces new features in Captum v0.7, a model explainability library for PyTorch, specifically designed to analyze the behavior of generative language models like GPT-3.	This paper is important as it addresses the growing need for explainability in large language models (LLMs) by introducing new tools in Captum that enhance the understanding of these models, especially for critical applications.	The authors introduce new functionalities in Captum, focusing on perturbation-based (Feature Ablation, LIME, Kernel SHAP, Shapley Value Sampling) and gradient-based (Saliency, Integrated Gradients) attribution methods. They provide APIs to define custom features, baselines, masking, and target selection for analyzing LLM behavior.	The paper showcases the application of new Captum functionalities in understanding model associations, revealing potential biases by analyzing attribution scores for input features. Additionally, they demonstrate the evaluation of few-shot prompt effectiveness, highlighting an unexpected reduction in confidence for a sentiment prediction task.	The authors acknowledge limitations in current attribution methods and highlight the need for automated feature and baseline selection. Future work involves incorporating other interpretability techniques, improving automation, and optimizing runtime performance for the open-source community.	llm, analysis, interpretability, explainability, attribution, perturbation-based methods, gradient-based methods, open-source, captum
2401.08573	Benchmarking the Robustness of Image Watermarks	Bang An, Mucong Ding, Tahseen Rabbani, Aakriti Agrawal, Yuancheng Xu, Chenghao Deng, Sicheng Zhu, Abdirisak Mohamed, Yuxin Wen, Tom Goldstein, Furong Huang	This paper investigates the weaknesses of image watermarking techniques. We present WAVES (Watermark Analysis Via Enhanced Stress-testing), a novel benchmark for assessing watermark robustness, overcoming the limitations of current evaluation methods.WAVES integrates detection and identification tasks, and establishes a standardized evaluation protocol comprised of a diverse range of stress tests. The attacks in WAVES range from traditional image distortions to advanced and novel variations of diffusive, and adversarial attacks. Our evaluation examines two pivotal dimensions: the degree of image quality degradation and the efficacy of watermark detection after attacks. We develop a series of Performance vs. Quality 2D plots, varying over several prominent image similarity metrics, which are then aggregated in a heuristically novel manner to paint an overall picture of watermark robustness and attack potency. Our comprehensive evaluation reveals previously undetected vulnerabilities of several modern watermarking algorithms. We envision WAVES as a toolkit for the future development of robust watermarking systems. The project is available at https://wavesbench.github.io/	The paper introduces WAVES, a novel benchmark for evaluating the robustness of image watermarking techniques, specifically focusing on their resistance to various attacks that aim to remove or obscure watermarks.	This paper is important because it addresses the lack of standardized evaluation methods for image watermarking techniques, especially in the context of emerging threats like diffusion purification and adversarial attacks. It proposes a comprehensive benchmark with diverse attacks, standardized metrics, and a focus on real-world scenarios, contributing to the development of more robust watermarking systems.	The authors conduct their research by developing a standardized evaluation protocol called WAVES. WAVES evaluates watermarking algorithms on three datasets (DiffusionDB, MS-COCO, and DALL·E3) using a wide range of 26 attacks categorized into distortions, regenerations, and adversarial attacks. It measures watermark detection performance using TPR@0.1%FPR and assesses image quality degradation using a normalized and aggregated metric combining 8 individual image quality metrics.	The evaluation reveals varying vulnerabilities among watermarking methods. Tree-Ring is particularly vulnerable to adversarial attacks, especially grey-box embedding attacks and surrogate detector attacks, which can significantly reduce detection performance while preserving image quality. Stable Signature is susceptible to various regeneration attacks, while StegaStamp demonstrates greater robustness overall. The paper also highlights the risk of using publicly available VAEs in watermarking systems, making them susceptible to attacks.	The authors acknowledge limitations in testing only three watermarking algorithms, albeit carefully chosen representatives. They also point out that the attack ranking methodology depends on selected performance thresholds and image quality metrics, suggesting further exploration with alternative metrics and thresholds as future work. Additionally, the paper encourages the development of watermark-specific defensive strategies and highlights the need for in-processing watermarks to adopt augmentation or adversarial training for enhanced robustness.	diffusion_model, watermark, analysis, adversarial_attack, benchmark, image_quality, robustness
2404.02145	Iterated Learning Improves Compositionality in Large Vision-Language Models	Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna	A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man in black" and "a girl in black facing a man in white". Moreover, prior work suggests that compositionality doesn't arise with scale: larger model sizes or training data don't help. This paper develops a new iterated training algorithm that incentivizes compositionality. We draw on decades of cognitive science research that identifies cultural transmission-the need to teach a new generation-as a necessary inductive prior that incentivizes humans to develop compositional languages. Specifically, we reframe vision-language contrastive learning as the Lewis Signaling Game between a vision agent and a language agent, and operationalize cultural transmission by iteratively resetting one of the agent's weights during training. After every iteration, this training paradigm induces representations that become "easier to learn", a property of compositional languages: e.g. our model trained on CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the SugarCrepe benchmark.	This paper proposes a novel iterated learning algorithm for vision-language models to improve their compositionality by drawing inspiration from the cultural transmission theory in cognitive science, where languages evolve to be more compositional over generations.	Despite the advancement in large vision-language models, existing models struggle with compositional understanding, limiting their ability to generalize and reason about novel situations. This paper addresses this issue with a novel training paradigm inspired by human language development, potentially paving the way for more robust and interpretable vision-language models.	The authors reframe vision-language contrastive learning as a Lewis Signaling Game between a vision agent and a language agent. They introduce a shared codebook as the basis for the representation of both agents, and periodically reset the language agent's weights, mimicking cultural transmission across generations. This forces the vision agent to learn representations that are easier to learn by new language agents, thus improving compositionality.	The proposed iterated learning algorithm demonstrably improves compositionality on several benchmarks, including SugarCrepe and CREPE, outperforming baseline models like standard CLIP and NegCLIP. Importantly, this improvement doesn't come at the cost of recognition capability, as shown by comparable performance on zero-shot image classification tasks. Further analysis suggests that iterated learning leads to smoother, easier-to-learn visual representations and a more interpretable codebook.	The paper acknowledges the potential instability during training due to the randomness introduced by resetting agent weights. Future work could focus on stabilizing the learning process and exploring extensions to other domains beyond vision and language.	diffusion_model, llm, analysis, interpretability
2312.02139	DiffiT: Diffusion Vision Transformers for Image Generation	Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat	Diffusion models with their powerful expressivity and high sample quality have achieved State-Of-The-Art (SOTA) performance in the generative domain. The pioneering Vision Transformer (ViT) has also demonstrated strong modeling capabilities and scalability, especially for recognition tasks. In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT). Specifically, we propose a methodology for finegrained control of the denoising process and introduce the Time-dependant Multihead Self Attention (TMSA) mechanism. DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency. We also propose latent and image space DiffiT models and show SOTA performance on a variety of class-conditional and unconditional synthesis tasks at different resolutions. The Latent DiffiT model achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset while having 19.85%, 16.88% less parameters than other Transformer-based diffusion models such as MDT and DiT, respectively. Code: https://github.com/NVlabs/DiffiT	This paper introduces DiffiT, a novel Vision Transformer (ViT)-based diffusion model designed for efficient and high-quality image generation in both latent and image spaces.	The paper addresses limitations in existing CNN-based and ViT-based diffusion models by introducing Time-dependant Multihead Self Attention (TMSA), which significantly enhances parameter efficiency and enables fine-grained control over the denoising process for improved image fidelity and diversity.	The authors propose a novel TMSA mechanism integrated into a U-shaped encoder-decoder architecture for image space generation and a purely ViT-based architecture for latent space generation. They train and evaluate DiffiT on diverse datasets, including ImageNet, FFHQ, and CIFAR10, and conduct thorough ablation studies to validate the effectiveness of TMSA and other architectural choices.	DiffiT achieves state-of-the-art FID scores on ImageNet-256 with significantly fewer parameters compared to previous SOTA models like MDT and DiT. It also achieves competitive results on FFHQ-64 and CIFAR10, showcasing its ability to generate high-fidelity, diverse images across different datasets and resolutions.	The paper acknowledges potential limitations in extending DiffiT to higher resolutions and exploring more complex image generation tasks. Future work could focus on optimizing the model for memory efficiency, leveraging larger datasets for training, and exploring applications in image editing, restoration, and text-to-image generation.	diffusion_model, vit, image_generation, tmsa, self-attention, latent_space, image_space, fid, parameter_efficiency
2404.08636	Probing the 3D Awareness of Visual Foundation Models	Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani	Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.	This paper investigates the 3D awareness of visual foundation models, examining how well these models represent the 3D structure of scenes and objects from single and multiple views.	This paper is important because it addresses the lack of understanding regarding how well visual foundation models, despite being trained on 2D data, represent the 3D world. This understanding is crucial as these models are increasingly used as backbones for 3D vision tasks.	The authors evaluate a range of visual foundation models, including those trained with classification, language supervision, self-supervision, and dense supervision, on their ability to estimate depth, surface normals, and 3D correspondence. They probe the frozen representations of these models using task-specific probes and zero-shot inference methods to assess the inherent 3D awareness of the learned features.	The analysis revealed that self-supervised models perform best in capturing surface properties like depth and normals, followed by text-conditioned generative models. However, all models struggled with multiview consistency, particularly at large viewpoint changes, indicating they might be learning view-dependent rather than truly 3D-consistent representations. Semantic correspondence performance was found to be more correlated with single-view tasks than multiview tasks, suggesting it might not be a reliable measure of 3D consistency.	The paper acknowledges limitations including the use of publicly available checkpoints trained on different datasets and with varying compute resources, potentially confounding the results. They suggest future work should focus on more controlled experiments to isolate the impact of training signals and explore a broader range of 3D understanding aspects beyond surface reconstruction and multiview consistency.	3d, analysis, depth_estimation, surface_normal, correspondence, vision_transformer, diffusion_model, self_supervised_learning, vision_language_model
2405.05806	MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation	Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Lei Zhang, Wangmeng Zuo	Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code will be publicly available at https://github.com/csyxwei/MasterWeaver.	This paper introduces MasterWeaver, a novel method for personalized text-to-image generation that prioritizes both accurate identity representation and flexible image editing capabilities from a single reference image.	This paper addresses the limitations of existing personalized text-to-image generation models, which often struggle to balance accurate identity preservation with flexible editing. MasterWeaver's ability to achieve both makes it a valuable tool for various applications, including personalized content creation.	MasterWeaver leverages a pre-trained Stable Diffusion model and incorporates an identity mapping network to inject identity features into the image generation process. It introduces an editing direction loss to improve text controllability and utilizes a face-augmented dataset to disentangle identity features from attributes, enhancing editability.	Experimental results demonstrate that MasterWeaver outperforms state-of-the-art methods in terms of identity fidelity, text alignment, and image quality. It produces high-quality personalized images with diverse attributes, clothing, backgrounds, and styles, even from a single reference image.	The authors acknowledge limitations in generating images with multiple personalized identities and achieving precise control over fine-grained attributes. Future work will address these limitations and explore ethical considerations related to potential deepfake generation.	diffusion_model, personalized_text-to_image_generation, identity_preservation, editability, face_editing, cross_attention
2402.10208	Recovering the Pre-Fine-Tuning Weights of Generative Models	Eliahu Horwitz, Jonathan Kahana, Yedid Hoshen	The dominant paradigm in generative modeling consists of two steps: i) pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained model with human values via fine-tuning. This practice is considered safe, as no current method can recover the unsafe, pre-fine-tuning model weights. In this paper, we demonstrate that this assumption is often false. Concretely, we present Spectral DeTuning, a method that can recover the weights of the pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In contrast to previous attacks that attempt to recover pre-fine-tuning capabilities, our method aims to recover the exact pre-fine-tuning weights. Our approach exploits this new vulnerability against large-scale models such as a personalized Stable Diffusion and an aligned Mistral.	This paper introduces the task of "Pre-Fine-Tuning Weight Recovery", a novel attack vector targeting fine-tuned models. It presents "Spectral DeTuning", an effective method for recovering the original weights of a pre-trained model using multiple LoRA fine-tuned versions.	This paper highlights a critical vulnerability in the current paradigm of model fine-tuning, particularly relevant due to the increasing popularity of LoRA and multi-flavored foundational models. It demonstrates that widely used models like Mistral and Stable Diffusion are susceptible to this attack, potentially compromising safety and alignment efforts.	The authors propose "Spectral DeTuning", an iterative, gradient-free algorithm leveraging low-rank matrix factorization to recover the pre-fine-tuning weights. They introduce a rank scheduler for enhanced optimization stability and faster convergence. They evaluate their method on a newly introduced benchmark "LoWRA Bench", comprising diverse models like ViT, Stable Diffusion, and Mistral, fine-tuned for various tasks.	Spectral DeTuning successfully recovers pre-fine-tuning weights with high precision across different models and tasks. It outperforms baseline methods, achieving near-perfect semantic convergence for ViT and effectively reversing personalization in Stable Diffusion and alignment in Mistral, as demonstrated by semantic evaluation metrics. The rank scheduler significantly improves convergence speed and accuracy.	The authors acknowledge limitations like the requirement of multiple LoRA models with a known, constant rank and the assumption of their public availability. Future work includes exploring attacks on models with varying LoRA ranks, extending the attack to other fine-tuning methods, and, most importantly, developing defenses against pre-fine-tuning weight recovery attacks.	diffusion_model, llm, analysis, adversarial_attack, interpretability, lora, fine-tuning, model_security, weight_recovery
2403.14599	MyVLM: Personalizing VLMs for User-Specific Queries	Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or	Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.	This paper introduces the concept of personalizing vision-language models (VLMs), enabling them to understand and reason about user-specific concepts, such as unique objects and individuals, with a focus on personalized image captioning and visual question answering.	This paper is important because it addresses the limitation of current VLMs in understanding user-specific concepts and proposes a method for personalization, opening up new opportunities for more meaningful and personalized human-computer interaction.	The authors propose MyVLM, a method that augments frozen VLMs (BLIP-2 and LLaVA) with concept heads to recognize user-specific concepts in images. It then learns a concept embedding in the VLM's feature space to guide the language model in incorporating the concept into generated responses, requiring only a few training images.	MyVLM successfully generates personalized captions and answers questions about user-specific objects and individuals in new images, generalizing to unseen contexts. It outperforms several handcrafted baselines, showing improved recall and text similarity, even with few training samples.	Limitations include biases inherent in VLMs, reliance on concept head quality, and potential context leakage during training. Future work includes mitigating these limitations, exploring additional regularization and augmentation techniques, and expanding to new personalized applications.	diffusion_model, llm, analysis, personalization, image_captioning, visual_question_answering, referring_expression_comprehension
2405.07987	The Platonic Representation Hypothesis	Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola	We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.	This paper proposes the Platonic Representation Hypothesis, which posits that neural networks, trained across various architectures and modalities, are converging towards a shared statistical representation of reality.	This hypothesis is significant because it suggests that scaling data and model size could be sufficient for achieving highly generalizable AI systems capable of performing well across a wide range of tasks. It also offers insights into the potential for cross-modal synergy and a deeper understanding of how AI models represent the world.	The authors provide evidence for their hypothesis by analyzing existing literature on representational similarity and conducting experiments measuring the alignment of vision and language models. They use techniques like model stitching, nearest-neighbor analysis, and compare representations across models trained on different datasets and with different objectives.	Key findings include: (1) Models with higher performance on a variety of tasks exhibit greater representational alignment, suggesting convergence towards a common solution as competence increases. (2) Alignment is observed even across modalities, with larger language models exhibiting greater alignment with vision models. (3) Alignment with vision representations is correlated with better performance on language-based reasoning tasks, indicating the practical benefits of such convergence.	The authors acknowledge limitations such as difficulty in measuring alignment and the possibility of modality-specific information hindering complete convergence. They suggest further research is needed to understand the precise representation being converged to, the role of non-bijective modalities, and the implications for special-purpose AI.	representation, convergence, multimodality, vision, language, scaling, analysis, platonic_representation
2404.13040	Analysis of Classifier-Free Guidance Weight Schedulers	Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernandez Abrevaya, David Picard, Vicky Kalogeiton	Classifier-Free Guidance (CFG) enhances the quality and condition adherence of text-to-image diffusion models. It operates by combining the conditional and unconditional predictions using a fixed weight. However, recent works vary the weights throughout the diffusion process, reporting superior results but without providing any rationale or analysis. By conducting comprehensive experiments, this paper provides insights into CFG weight schedulers. Our findings suggest that simple, monotonically increasing weight schedulers consistently lead to improved performances, requiring merely a single line of code. In addition, more complex parametrized schedulers can be optimized for further improvement, but do not generalize across different models and tasks.	This paper investigates the use of dynamic weight schedulers in Classifier-Free Guidance (CFG) for diffusion models, showing that these schedulers can improve image fidelity, diversity, and textual adherence compared to static CFG.	This paper is important because it provides a comprehensive analysis of dynamic guidance weight schedulers in CFG, which is a widely used technique for conditional diffusion models. The findings provide practical guidance for practitioners to improve the performance of their diffusion models with simple modifications.	The authors conducted experiments on various tasks, including class-conditioned image generation and text-to-image generation, using datasets like CIFAR-10, ImageNet, and LAION. They evaluated different heuristic and parameterized dynamic schedulers, comparing their performance against static CFG using metrics like FID, Inception Score, CLIP-Score, and diversity measures. They also performed a user study to assess the perceptual quality of generated images.	Key findings include: (1) monotonically increasing weight schedulers (e.g., linear and cosine) consistently improve performance over static CFG; (2) a simple linear scheduler significantly enhances results without additional computational cost or parameter tuning; (3) parameterized schedulers can further improve performance but require tuning for each model and task.	The authors acknowledge that the optimal parameters for parameterized schedulers do not generalize across different models and tasks. Future work could focus on developing more adaptable and robust parameterized schedulers. Another direction is to investigate the theoretical underpinnings of why dynamic schedulers work better than static CFG, leading to more principled design of these schedulers.	diffusion_model, cfg, analysis, image_generation, text-to-image, fid, inception_score, clip-score, diversity
2405.05538	A Survey on Personalized Content Synthesis with Diffusion Models	Xulu Zhang, Xiao-Yong Wei, Wengyu Zhang, Jinlin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li	Recent advancements in generative models have significantly impacted content creation, leading to the emergence of Personalized Content Synthesis (PCS). With a small set of user-provided examples, PCS aims to customize the subject of interest to specific user-defined prompts. Over the past two years, more than 150 methods have been proposed. However, existing surveys mainly focus on text-to-image generation, with few providing up-to-date summaries on PCS. This paper offers a comprehensive survey of PCS, with a particular focus on the diffusion models. Specifically, we introduce the generic frameworks of PCS research, which can be broadly classified into optimization-based and learning-based approaches. We further categorize and analyze these methodologies, discussing their strengths, limitations, and key techniques. Additionally, we delve into specialized tasks within the field, such as personalized object generation, face synthesis, and style personalization, highlighting their unique challenges and innovations. Despite encouraging progress, we also present an analysis of the challenges such as overfitting and the trade-off between subject fidelity and text alignment. Through this detailed overview and analysis, we propose future directions to advance the development of PCS.	This paper presents a comprehensive survey of Personalized Content Synthesis (PCS) with diffusion models, focusing on techniques that enable the generation of customized images based on user-provided references and text prompts.	This survey is important due to the rapid growth and significance of PCS in various applications, including content creation, digital marketing, and virtual reality. It provides a timely and comprehensive overview of this evolving field, analyzing different frameworks, specialized tasks, and future challenges.	The paper categorizes PCS approaches into optimization-based and learning-based methods, analyzing their strengths and limitations. It reviews specialized tasks like object, style, and face personalization, highlighting key techniques like attention manipulation and mask-guided generation.	The survey reveals significant progress in PCS, with methods achieving impressive results in generating personalized content. It identifies key techniques like attention-based operations, mask-guided generation, data augmentation, and regularization as crucial for improving PCS performance. The paper also provides a comparative analysis of different PCS methods and their performance on benchmark datasets.	The paper identifies key challenges in PCS, including overfitting to limited references, balancing subject fidelity with text alignment, and the lack of standardized evaluation metrics and datasets. It suggests future research directions, such as exploring new architectures, training methodologies, and robust evaluation techniques to address these limitations.	diffusion_model, personalized_content_synthesis, image_generation, optimization, learning_based, attention_mechanism, mask-guided, data_augmentation, regularization, object_generation, face_synthesis, style_personalization, video, 3d
2308.08428	ALIP: Adaptive Language-Image Pre-training with Synthetic Caption	Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu	Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/ALIP.	This paper presents ALIP (Adaptive Language-Image Pre-training), a novel method for improving vision-language pre-training by addressing the issue of noisy and mismatched image-text pairs in large web-crawled datasets.	This paper is important because it tackles the critical challenge of data noise in large-scale vision-language pre-training, which can negatively impact model performance. ALIP offers a computationally efficient alternative to existing filtering or momentum-based methods by leveraging synthetic captions and a novel adaptive contrastive loss.	The authors propose a bi-path model that leverages both raw text and synthetic captions generated by the OFA model. They introduce two key components: the Language Consistency Gate (LCG), which weighs samples based on the consistency between raw and synthetic captions, and the Description Consistency Gate (DCG), which weighs image-text pairs based on their alignment. These weights are then integrated into an adaptive contrastive loss function to guide training.	ALIP achieves state-of-the-art performance on zero-shot image-text retrieval tasks, demonstrating significant improvements over previous methods. It also shows competitive results on linear probe evaluations, indicating its strong representation learning capabilities. However, it lags behind state-of-the-art in zero-shot classification tasks, suggesting that the coarse granularity of the synthetic captions might limit performance in fine-grained tasks.	The authors acknowledge limitations in the granularity of synthetic captions, which might hinder performance on tasks requiring fine-grained understanding. Future work includes exploring higher-quality caption generation models and investigating techniques to incorporate hierarchical semantic information into ALIP.	diffusion_model, llm, analysis, image-text retrieval, contrastive_learning, pre-training, noise_alleviation
2311.05020	First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models	Naomi Saphra, Eve Fleisig, Kyunghyun Cho, Adam Lopez	Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.	This paper examines the "scale crisis" in NLP research, where the dominance of large language models (LLMs) trained on massive datasets challenges the relevance of research from smaller groups. By reflecting on the history of statistical machine translation (SMT) and its own era of LLMs, the authors argue that the current crisis is transient and propose research directions for meaningful contributions even in the age of massive models.	The paper addresses the widespread anxiety among NLP researchers about the impact of LLMs on the field. It provides historical context and practical guidance for navigating the challenges and opportunities presented by the current research landscape.	The authors analyze the trajectory of SMT, particularly the rise and fall of large n-gram models, drawing parallels to the current era of LLMs. They use this historical analysis to identify durable lessons and evergreen research problems relevant to the present situation.	The paper highlights that scale disparities are often temporary, as demonstrated by the eventual accessibility of large-scale SMT systems in the past. It argues that data remains a significant bottleneck, especially for low-resource languages, and emphasizes the crucial need for improved evaluation metrics that accurately capture model performance beyond simple benchmarks.	The paper acknowledges its limitations in predicting the future of NLP research. It suggests future work should focus on improving evaluation metrics, developing algorithms for future hardware, exploring new paradigms that might supersede current LLMs, and addressing ethical considerations related to data bias and human evaluation.	llm, analysis, literature_review, machine_translation, evaluation, data_scarcity, hardware, future_work
2312.02142	Object Recognition as Next Token Prediction	Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, Ser-Nam Lim	We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp	This paper presents a novel approach to object recognition by framing it as a next token prediction problem, utilizing a language decoder to auto-regressively predict object labels from image embeddings.	The paper is significant because it offers an open-vocabulary object recognition method that eliminates the need for predefined object labels or descriptions, unlike traditional linear classifiers and contrastive frameworks. It proposes an efficient and innovative one-shot sampling method for parallel label generation and introduces a compact decoder for enhanced efficiency.	The authors employ a pretrained CLIP image encoder to generate image embeddings and a truncated language decoder (derived from LLaMA) to predict labels auto-regressively. They introduce a non-causal attention mask to decouple tokens from different labels and treat image tokens as a prefix. The one-shot sampling method enables parallel label token generation, while the compact decoder enhances efficiency. The method is trained on large-scale image-caption pairs and evaluated using a semantic similarity-based metric.	Key findings include the effectiveness of one-shot sampling for generating diverse labels in parallel, outperforming traditional greedy and beam search methods. The truncated language decoder achieves comparable performance to the full model while being significantly faster. The method surpasses existing open-vocabulary recognition approaches in recall and achieves competitive performance in precision, demonstrating its ability to generate highly relevant labels.	The authors acknowledge limitations in training data quality and evaluation metrics. They suggest future work exploring methods to train models with fewer labels, refining the label definition, developing better evaluation metrics, and adapting the approach for fine-grained recognition tasks.	object_recognition, next_token_prediction, language_decoder, auto-regressive, open_vocabulary, one-shot_sampling, truncated_language_model, llama, clip, semantic_similarity, efficiency
2404.19752	Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation	Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui	Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.	This paper introduces VisualFactChecker (VFC), a training-free pipeline that leverages large language models (LLMs) and existing computer vision models to generate detailed and factually grounded captions for both 2D images and 3D objects.	This work addresses limitations in current open-source image captioning models, which often produce captions that are too concise or contain hallucinations (i.e., descriptions of elements not present in the image). VFC utilizes a unique approach of fact-checking generated captions using object detection and visual question answering (VQA), resulting in higher fidelity and accuracy compared to existing open-sourced models.	VFC operates through a three-step process: 1) Proposal: Multiple image captioning models generate initial captions. 2) Verification: An LLM uses object detection or VQA models to verify elements described in the captions. 3) Captioning: The LLM summarizes the initial captions and verification results to produce a final, factually grounded caption. The authors evaluate VFC on COCO (2D images) and Objaverse (3D objects) datasets using CLIP-Score, a novel CLIP-Image-Score, human evaluation via AMT, and GPT-4V for fine-grained analysis.	VFC outperforms state-of-the-art open-source captioning methods in both 2D and 3D captioning tasks. Notably, it achieves performance comparable to proprietary models like GPT-4V despite being significantly smaller. The novel CLIP-Image-Score, introduced in this work, demonstrates effectiveness in detecting hallucinations by comparing original images with those reconstructed from generated captions.	The authors acknowledge that the current implementation of VFC could be more automated in deciding which components to utilize for specific scenarios. Future work aims to address this limitation and explore the inclusion of additional components for fact-checking to further improve caption accuracy and detail.	diffusion_model, llm, captioning, 2d, 3d, hallucination, vqa, object_detection, analysis
2308.16512	MVDream: Multi-view Diffusion for 3D Generation	Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, Xiao Yang	We introduce MVDream, a diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view diffusion model is implicitly a generalizable 3D prior agnostic to 3D representations. It can be applied to 3D generation via Score Distillation Sampling, significantly enhancing the consistency and stability of existing 2D-lifting methods. It can also learn new concepts from a few 2D examples, akin to DreamBooth, but for 3D generation.	This paper introduces MVDream, a multi-view diffusion model that addresses the multi-view consistency issues in text-to-3D generation by leveraging large-scale 2D image datasets and 3D data for improved generalizability and consistency in generated 3D models.	This work is important because it presents a novel approach to address the long-standing challenge of multi-view consistency in text-to-3D generation, which is crucial for creating high-quality and realistic 3D content. MVDream's ability to leverage pre-trained 2D diffusion models and adapt them for multi-view consistency opens new avenues for efficient and robust 3D content creation.	The authors propose a multi-view diffusion model that incorporates 3D self-attention and camera embeddings into a pre-trained 2D diffusion model. They train this model on a combination of 3D rendered data and a large-scale text-to-image dataset. For 3D generation, they employ score distillation sampling (SDS), utilizing their multi-view diffusion model as a prior. They further introduce a multi-view DreamBooth technique for personalized 3D generation.	MVDream demonstrates superior multi-view consistency and overall quality in generated 3D models compared to existing state-of-the-art methods. Notably, it mitigates the Janus problem (multi-face issue) commonly observed in other approaches. User studies confirm the improved robustness and quality of MVDream's generated 3D assets. Furthermore, the model exhibits good generalization ability, effectively generating 3D content from unseen prompts and in diverse styles.	The authors acknowledge limitations such as the current model's lower resolution compared to some existing models and the potential for bias inherited from the base Stable Diffusion model. They suggest addressing these limitations by increasing the dataset size, incorporating larger base diffusion models (e.g., SDXL), and utilizing more diverse and realistic 3D rendering datasets. Future work may explore extensions for handling a larger number of non-orthogonal camera views, improving the generalizability further.	diffusion_model, 3d, text-to-3d, multi-view, consistency, dreambooth, score distillation sampling, nerf, generative_model
2404.01231	Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models	Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, Nicholas Carlini	It is commonplace to produce application-specific models by fine-tuning large pre-trained models using a small bespoke dataset. The widespread availability of foundation model checkpoints on the web poses considerable risks, including the vulnerability to backdoor attacks. In this paper, we unveil a new vulnerability: the privacy backdoor attack. This black-box privacy attack aims to amplify the privacy leakage that arises when fine-tuning a model: when a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. We conduct extensive experiments on various datasets and models, including both vision-language models (CLIP) and large language models, demonstrating the broad applicability and effectiveness of such an attack. Additionally, we carry out multiple ablation studies with different fine-tuning methods and inference strategies to thoroughly analyze this new threat. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.	This paper introduces a new privacy backdoor attack that amplifies membership inference attacks by poisoning pre-trained models, making it easier to extract information about the training data used for fine-tuning.	This paper is important as it highlights a significant privacy vulnerability in the current machine learning paradigm of using open-source pre-trained models. It demonstrates that an adversary can poison these models to leak private information from the fine-tuning datasets, raising serious concerns about data security and the trustworthiness of pre-trained models.	The authors poison the model weights to either maximize or minimize the loss on target data points during pre-training. This creates an anomaly in the loss, making it easier to distinguish data used in fine-tuning. They test their attack on various models, including CLIP for vision tasks and GPT-Neo and ClinicalBERT for language tasks, using different datasets and evaluating the effectiveness under different fine-tuning methods and inference strategies.	The attack significantly improves the success rate of membership inference attacks, increasing the true positive rate while maintaining a low false positive rate. The attack is effective across different models, fine-tuning methods, and inference strategies, highlighting its robustness and broad applicability. Interestingly, the attack also amplifies privacy leakage for non-target data points from the same distribution. The paper also finds that larger models are more vulnerable to this attack.	The paper acknowledges limitations regarding the attack's sensitivity to the number of fine-tuning steps and the trade-off between model stealthiness and attack performance. Future work includes exploring more advanced poisoning techniques and defenses against this attack, such as robust fine-tuning methods and more rigorous validation of pre-trained models.	privacy, backdoor_attack, membership_inference, poisoning, pre-trained_model, fine-tuning, clip, llm, analysis
2311.12229	NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation	Shachar Rosenman, Vasudev Lal, Phillip Howard	Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive framework that automatically enhances a user's prompt to improve the quality of generations produced by text-to-image models. Our framework utilizes constrained text decoding with a pre-trained language model that has been adapted to generate prompts similar to those produced by human prompt engineers. This approach enables higher-quality text-to-image generations and provides user control over stylistic features via constraint set specification. We demonstrate the utility of our framework by creating an interactive application for prompt enhancement and image generation using Stable Diffusion. Additionally, we conduct experiments utilizing a large dataset of human-engineered prompts for text-to-image generation and show that our approach automatically produces enhanced prompts that result in superior image quality. We make our code and a screencast video demo of NeuroPrompts publicly available.	This paper introduces NeuroPrompts, a novel framework designed to automatically enhance user-provided prompts for text-to-image generation models, leading to higher-quality and more aesthetically pleasing image outputs.	This paper is significant because it addresses the challenge of prompt engineering in text-to-image generation, making these powerful models more accessible to users without specialized expertise by automating the process of crafting effective prompts.	The authors developed NeuroPrompts, which uses a two-stage approach: 1) Adapting a pre-trained language model (LM) to generate text similar to human prompt engineers through supervised fine-tuning and reinforcement learning with a reward model based on predicted human preferences (PickScore). 2) Employing NeuroLogic Decoding, a constrained text decoding algorithm, to generate enhanced prompts that satisfy user-specified constraints for style, artist, format, etc., while adhering to the learned prompting style.	The authors demonstrated that NeuroPrompts consistently generates higher-quality images than un-optimized prompts and even surpasses human-authored prompts in terms of aesthetic scores. They also found that both PPO training and constrained decoding with NeuroLogic contribute to the improved performance of the framework.	The authors acknowledge limitations in evaluating NeuroPrompts solely with Stable Diffusion and recognize the potential for societal biases inherited from the base model. Future work could focus on extending NeuroPrompts to video generation models and other domains requiring automated prompt engineering.	diffusion_model, prompt_engineering, text-to-image, image_generation, aesthetic_quality, constrained_decoding, reinforcement_learning, ppo, neurologic, stable_diffusion, pickscore
2311.18608	Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing	Hyelin Nam, Gihyun Kwon, Geon Yeong Park, Jong Chul Ye	With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. To address this, here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Inspired by the similarities and differences between DDS and the contrastive learning for unpaired image-to-image translation(CUT), we introduce a straightforward approach using CUT loss within the DDS framework. Rather than employing auxiliary networks as in the original CUT approach, we leverage the intermediate features of LDM, specifically those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving structural correspondence between the input and output while maintaining content controllability. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page: https://hyelinnam.github.io/CDS/	This paper introduces Contrastive Denoising Score (CDS), a novel text-guided image editing technique for latent diffusion models that improves upon Delta Denoising Score (DDS) by incorporating a contrastive loss inspired by Contrastive Unpaired Translation (CUT).	This paper addresses the limitation of DDS in preserving structural details during text-guided image editing. By integrating CUT loss into the DDS framework, CDS enables more effective preservation of source image structure while aligning with target text prompts, leading to improved image editing quality.	The authors propose to extract intermediate features from the self-attention layers of the latent diffusion model and use them to calculate the CUT loss. This loss is then incorporated into the DDS framework to guide the image generation process towards better structural consistency. The authors demonstrate the effectiveness of their approach through qualitative and quantitative experiments on various text-driven image editing tasks, including comparisons with state-of-the-art methods. They also show the extensibility of CDS to other domains like Neural Radiance Fields (NeRF).	CDS outperforms existing state-of-the-art methods in text-guided image editing by effectively regulating structural consistency while aligning with target text prompts. It achieves a better balance between preserving structural details and transforming content compared to DDS and other baselines. Furthermore, CDS demonstrates successful application in Neural Radiance Fields editing, highlighting its extensibility.	The authors acknowledge limitations in cases of unfavorable random patch selections or unconventional object poses. Future work may explore strategies to address these limitations. Additionally, the ethical implications of image manipulation techniques like CDS are acknowledged, emphasizing the need for responsible use and regulation to prevent misuse.	diffusion_model, image_editing, text-guided_synthesis, contrastive_learning, structure_preservation, latent_diffusion_model, nerf, zero-shot, unsupervised_learning
2402.17113	Transparent Image Layer Diffusion using Latent Transparency	Lvmin Zhang, Maneesh Agrawala	We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.	This paper introduces LayerDiffuse, a novel approach that enables large-scale pretrained latent diffusion models to generate transparent images, either as single entities or multiple transparent layers, by encoding transparency as a latent offset in the model's latent space.	This paper is significant because it addresses the lack of research in generating transparent images and layered content despite its high demand in visual content editing. It achieves this by tackling the challenges of limited training data and the sensitivity of pretrained diffusion models to alterations in their latent space representation.	The authors develop 'latent transparency,' a method that encodes alpha channel transparency into the latent space of a pretrained diffusion model (Stable Diffusion) without disrupting its latent distribution. They train their model using a human-in-the-loop scheme to collect a dataset of 1 million transparent image layer pairs, using GPT models to generate diverse and semantically related prompts for foreground and background layers.	LayerDiffuse successfully generates high-quality transparent images and layers, as demonstrated through qualitative results and a user study. Users significantly preferred LayerDiffuse's native transparency over conventional generation-then-matting methods, with its quality being comparable to commercial transparent image assets.	The authors acknowledge a limitation in balancing the generation of 'clean transparent elements' and their 'harmonious blending,' particularly when dealing with reusable elements devoid of specific illumination effects. They suggest exploring improved methods for harmonious blending as future work.	diffusion_model, transparent_image_generation, layered_content_generation, latent_space, human-in-the-loop, image_synthesis
2310.05654	No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling	Xuwei Xu, Changlin Li, Yudong Chen, Xiaojun Chang, Jiajun Liu, Sen Wang	Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks, yet their high computational complexity prevents their deployment in computing resource-constrained environments. Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs by dynamically dropping image tokens. However, some undesirable pruning at early stages may result in permanent loss of image information in subsequent layers, consequently hindering model performance. To address this problem, we propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency. Specifically, in each layer, IdleViT selects a subset of the image tokens to participate in computations while keeping the rest of the tokens idle and directly passing them to this layer's output. By allowing the idle tokens to be re-selected in the following layers, IdleViT mitigates the negative impact of improper pruning in the early stages. Furthermore, inspired by the normalized graph cut, we devise a token cut loss on the attention map as regularization to improve IdleViT's token selection ability. Our method is simple yet effective and can be extended to pyramid ViTs since no token is completely dropped. Extensive experimental results on various ViT architectures have shown that IdleViT can diminish the complexity of pretrained ViTs by up to 33\% with no more than 0.2\% accuracy decrease on ImageNet, after finetuning for only 30 epochs. Notably, when the keep ratio is 0.5, IdleViT outperforms the state-of-the-art EViT on DeiT-S by 0.5\% higher accuracy and even faster inference speed. The source code is available in the supplementary material.	This paper introduces IdleViT, a novel approach for enhancing the efficiency of Vision Transformers (ViTs) by dynamically idling tokens during inference.	This paper is important because it addresses the computational cost of ViTs, especially for resource-constrained applications, by dynamically selecting informative tokens and idling others, leading to improved inference speed without significant accuracy degradation.	The authors propose IdleViT, which leverages a lightweight prediction head to identify and idle less informative tokens at each layer. This is done by training the model with a keep ratio, controlling the number of active tokens. They evaluate IdleViT on ImageNet using DeiT and LV-ViT architectures and compare it to other efficient ViT models.	IdleViT achieves significant speed improvements (up to 52%) compared to the full models with minimal accuracy loss (less than 0.3%). It outperforms other efficient ViT and convolutional models on the trade-off between accuracy and computational complexity.	Limitations are not explicitly mentioned in the provided text. However, possible future work could involve exploring different prediction head architectures or investigating the generalization of IdleViT to other downstream tasks beyond image classification.	diffusion_model, llm, analysis, 3d, motion, video, interpretability
2404.03631	Robust Concept Erasure Using Task Vectors	Minh Pham, Kelly O. Marshall, Chinmay Hegde, Niv Cohen	With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user's prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.	This paper proposes a novel method for removing unsafe concepts from text-to-image models using Task Vectors (TV) in a way that is independent of specific user prompts, making it more robust than existing input-dependent concept erasure methods.	The paper addresses the critical challenge of preventing the generation of undesirable content from text-to-image models, a growing concern as these models become increasingly powerful. It highlights the limitations of existing concept erasure techniques that primarily focus on specific user prompts and demonstrates the vulnerability of such approaches to adversarial attacks. The proposed method offers a more robust solution by aiming for unconditional concept erasure.	The authors propose a three-part method: (1) Diverse Inversion: This technique finds a diverse set of token embeddings that can generate the unsafe concept, enabling a more comprehensive evaluation of the model's safety. (2) TV Edit Strength Tuning: Using the diverse set of adversarial prompts, the authors determine an optimal edit strength for the TV that effectively suppresses unsafe generation while preserving the model's utility on unrelated tasks. (3) TV Weight Sub-selection: The authors explore pruning specific layers of the TV weights to further enhance the trade-off between concept erasure and model performance.	The paper demonstrates that TV-based concept erasure is more resistant to adversarial attacks compared to existing methods, showing robustness against techniques like Concept Inversion and Ring-A-Bell. The proposed Diverse Inversion method proves effective in finding a wide range of adversarial prompts, allowing for better estimation of the TV edit strength. Additionally, the authors show that sub-selecting TV weights can lead to a better balance between concept erasure and preserving the model's functionality on unrelated tasks.	The paper acknowledges limitations such as the lack of provable guarantees for erasure against unknown future adversarial methods and the dependence on the Diverse Inversion set for hyperparameter tuning. Future work could focus on exploring the application of TV-based erasure for more fine-grained concept removal and extending the approach to other modalities like language models.	diffusion_model, gan, adversarial_attack, interpretability, text-to-image, concept_erasure, safety
2405.11473	FIFO-Diffusion: Generating Infinite Videos from Text without Training	Jihwan Kim, Junoh Kang, Jinyoung Choi, Bohyung Han	We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines.	This paper presents FIFO-Diffusion, a novel inference technique based on pretrained diffusion models for generating arbitrarily long text-conditional videos without additional training.	This paper is significant because it addresses the limitations of existing long video generation methods that suffer from temporal inconsistency or high computational cost, enabling the generation of high-quality, coherent videos of any length using only a pretrained model.	The authors introduce diagonal denoising, a process that concurrently handles multiple frames with increasing noise levels within a queue. To mitigate the training-inference discrepancy introduced by diagonal denoising, they further propose latent partitioning and lookahead denoising, which refine the noise level differences and improve denoising accuracy, respectively.	FIFO-Diffusion demonstrates impressive results in generating extremely long videos (over 10,000 frames) with consistent quality and smooth motion, outperforming existing methods like FreeNoise and Gen-L-Video. It also showcases the ability to seamlessly transition between multiple prompts, enabling the creation of diverse and engaging video content.	The authors acknowledge the remaining training-inference gap due to the alteration of input distribution caused by diagonal denoising. Future work includes integrating the diagonal denoising paradigm into the training process to further improve performance and reduce this gap.	diffusion_model, video, text-to-video, long_video_generation, analysis
2404.07449	Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs	Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin	Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.	This paper introduces a novel framework called Locate-Anything to enhance the spatial awareness of Visual-LLMs (V-LLMs) by incorporating textual image-space coordinates into both the input prompts and the LLM-generated outputs.	This research is important because it addresses a critical limitation of current V-LLMs: their weak spatial reasoning and localization abilities. By improving the spatial awareness of V-LLMs, this work enables more comprehensive visual understanding and opens up new possibilities for vision-language tasks.	The authors propose three novel instruction fine-tuning objectives that leverage textual coordinate representations: Location Prediction, Negative Prediction, and Reverse-Location Prediction. They explore different coordinate representation schemes and introduce pseudo-data generation strategies to enhance data efficiency and extend the framework to video domains.	The proposed Locate-Anything model demonstrates significant improvements in spatial reasoning, outperforming existing V-LLMs in tasks like distinguishing object positions. It achieves state-of-the-art results on Image VQA, Video VQA, and Region Description benchmarks while effectively reducing object hallucination.	The paper identifies limitations in understanding temporal locations for video-based tasks, suggesting future work on incorporating time coordinates. Additionally, potential biases within training datasets are acknowledged, highlighting the need for careful consideration during model deployment.	diffusion_model, llm, analysis, video, vqa, interpretability
2405.01536	Customizing Text-to-Image Models with a Single Image Pair	Maxwell Jones, Sheng-Yu Wang, Nupur Kumari, David Bau, Jun-Yan Zhu	Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.	This paper introduces a novel method, Paired Customization, for customizing text-to-image models using a single image pair to learn stylistic differences.	This paper is significant because it addresses the limitations of existing model customization techniques that often overfit to content when learning styles from single or few-shot image examples. By using an image pair, the method can better disentangle style from content, enabling more effective and generalizable style transfer.	The authors propose a joint optimization method using separate LoRA weights for style and content. Content LoRA reconstructs the content image, while style LoRA learns the stylistic difference between the pair. They further enforce orthogonality between style and content LoRA parameters for better disentanglement. At inference, they introduce 'style guidance', integrating style LoRA predictions into the denoising process for improved style control and content preservation.	The proposed method demonstrates superior performance in capturing and applying stylistic differences compared to existing baselines. It effectively preserves the structure of the input content while applying the learned style, as demonstrated through quantitative metrics like perceptual distance and a human preference study.	The paper acknowledges limitations in handling significantly different categories from the training pair and computational demands of test-time optimization. Future work could explore encoder-based approaches for faster customization and improving style transfer across broader categories.	diffusion_model, gan, customization, style_transfer, image_generation, lora, orthogonal, disentanglement, style_guidance
2402.15179	Advancing Parameter Efficiency in Fine-tuning via Representation Editing	Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang	Parameter Efficient Fine-Tuning (PEFT) has gained significant attention for its ability to achieve competitive results while updating only a small subset of trainable parameters. Despite the promising performance of current PEFT methods, they present challenges in hyperparameter selection, such as determining the rank of LoRA or Adapter, or specifying the length of soft prompts. In addressing these challenges, we propose a novel approach to fine-tuning neural models, termed Representation EDiting (RED), which scales and biases the representation produced at each layer. RED substantially reduces the number of trainable parameters by a factor of $25,700$ compared to full parameter fine-tuning, and by a factor of $32$ compared to LoRA. Remarkably, RED achieves comparable or superior results to full parameter fine-tuning and other PEFT methods. Extensive experiments were conducted across models of varying architectures and scales, including RoBERTa, GPT-2, T5, and Llama-2, and the results demonstrate the efficiency and efficacy of RED, positioning it as a promising PEFT approach for large neural models.	This paper introduces Representation EDiting (RED), a novel parameter-efficient fine-tuning (PEFT) method that scales and biases representations at each layer of a pre-trained language model to adapt it to downstream tasks.	The paper addresses the limitations of existing PEFT methods in terms of hyperparameter selection and parameter efficiency. It proposes RED as a more efficient and effective alternative to fine-tune large language models, reducing the number of trainable parameters significantly while achieving comparable or superior performance.	The authors evaluate RED on a variety of language models (RoBERTa, GPT-2, T5, Llama-2) and NLP tasks (GLUE benchmark, E2E NLG Challenge, UltraFeedback, Open LLM Leaderboard, AlpacaEval, MT-Bench). They compare RED against several baselines, including full fine-tuning, Adapter, LoRA, BitFit, and Prompt Tuning. Ablation studies were conducted to analyze the impact of different components of RED, such as the type and position of 'edit vectors'.	RED consistently achieves comparable or better performance than other PEFT methods while using significantly fewer trainable parameters. For instance, RED requires 25,700 times fewer parameters than full fine-tuning and 32 times fewer than LoRA on Llama-2 7B while achieving comparable or even better results across different benchmarks. Ablation studies show that both scaling and bias vectors contribute to RED's performance, and editing representations after the FFN sub-layer is the most effective strategy.	The authors acknowledge that RED's application in other modalities like computer vision and speech recognition needs further investigation. They plan to explore RED in few-shot learning scenarios to enhance its data efficiency.	peft, fine-tuning, representation_learning, language_model, efficiency, llama-2, gpt-2, roberta, t5, nlu, nlg
2405.03150	Video Diffusion Models: A Survey	Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter	Diffusion generative models have recently become a robust technique for producing and modifying coherent, high-quality video. This survey offers a systematic overview of critical elements of diffusion models for video generation, covering applications, architectural choices, and the modeling of temporal dynamics. Recent advancements in the field are summarized and grouped into development trends. The survey concludes with an overview of remaining challenges and an outlook on the future of the field. Website: https://github.com/ndrwmlnk/Awesome-Video-Diffusion-Models	This paper presents a comprehensive survey of diffusion models for video generation, focusing on their applications, architectures, methods for modeling temporal dynamics, and training procedures.	This survey is important due to the rapid progress and transformative potential of diffusion models in video generation. It provides a valuable resource for researchers and practitioners by summarizing key advancements, identifying trends, and highlighting remaining challenges in the field.	The authors conduct a systematic literature review, analyzing and categorizing existing research on video diffusion models based on various criteria. They provide a taxonomy of applications, discuss architectural choices, and delve into methods for modeling temporal dynamics. The authors also review training strategies and evaluation metrics commonly employed in this domain.	Key findings include the increasing utilization of latent diffusion models for efficient, high-resolution video generation, the dominance of UNet architectures with modifications for temporal consistency, and the prevalence of pre-trained text-to-image models as backbones for video generation and editing. The survey also highlights the challenges posed by limited labeled video data and the need for better representation of temporal dependencies in videos.	The authors identify several limitations and avenues for future work, including the need for larger, accurately labeled video datasets, improved methods for representing complex temporal relationships in videos, and exploration of alternative architectures capable of handling long-term temporal dependencies more effectively. Furthermore, the authors suggest exploring real-time video-to-video translation and more sophisticated video description methods beyond simple text labels.	diffusion_model, video, generation, editing, survey, temporal_dynamics, latent_diffusion_model, unet, attention_mechanism, transformer
2402.11131	Speculative Streaming: Fast LLM Inference without Auxiliary Models	Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi	Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.	This paper introduces Speculative Streaming, a single-model speculative decoding approach that accelerates large language model inference by fusing drafting into the target model, changing the objective from next-token to future n-gram prediction.	This work is important because it addresses the limitations of traditional speculative decoding methods that rely on separate, resource-intensive draft models, thereby simplifying deployment and improving efficiency for large language model inference, especially on resource-constrained devices.	The authors introduce multi-stream attention into the target model for n-gram prediction, enabling parallel speculation and verification of candidate tokens within a single forward pass. They utilize tree-structured drafting for efficient exploration of candidate sequences and employ a pruning strategy based on transition probabilities to manage computational cost.	Speculative Streaming achieves 1.8-3.1X speedup across tasks like summarization, structured queries, and meaning representation without sacrificing generation quality. It also demonstrates comparable or superior performance to Medusa, a recent block-wise decoding model, while using significantly fewer parameters, making it ideal for resource-constrained devices.	The authors acknowledge that the current implementation uses a "hard" matching criterion for draft verification and suggest exploring "soft" matching for potential speedup gains. Future work may involve investigating alternative stream initialization techniques beyond the explored value rotation and dedicated embeddings.	llm, diffusion_model, analysis, speculative_decoding, inference, resource_constrained
2311.12832	Toward effective protection against diffusion based mimicry through score distillation	Haotian Xue, Chumeng Liang, Xiaoyu Wu, Yongxin Chen	While generative diffusion models excel in producing high-quality images, they can also be misused to mimic authorized images, posing a significant threat to AI systems. Efforts have been made to add calibrated perturbations to protect images from diffusion-based mimicry pipelines. However, most of the existing methods are too ineffective and even impractical to be used by individual users due to their high computation and memory requirements. In this work, we present novel findings on attacking latent diffusion models (LDM) and propose new plug-and-play strategies for more effective protection. In particular, we explore the bottleneck in attacking an LDM, discovering that the encoder module rather than the denoiser module is the vulnerable point. Based on this insight, we present our strategy using Score Distillation Sampling (SDS) to double the speed of protection and reduce memory occupation by half without compromising its strength. Additionally, we provide a robust protection strategy by counterintuitively minimizing the semantic loss, which can assist in generating more natural perturbations. Finally, we conduct extensive experiments to substantiate our findings and comprehensively evaluate our newly proposed strategies. We hope our insights and protective measures can contribute to better defense against malicious diffusion-based mimicry, advancing the development of secure AI systems. The code is available in https://github.com/xavihart/Diff-Protect	This paper investigates the vulnerability of Latent Diffusion Models (LDMs) to adversarial attacks, particularly in the context of protecting images from unauthorized mimicry.	The paper is important because it addresses the growing concern of malicious use of LDMs for creating unauthorized digital replicas, and it proposes more efficient and effective methods for protecting images from such misuse.	The authors analyze the bottleneck in attacking LDMs, revealing the encoder as the vulnerable component. They introduce Score Distillation Sampling (SDS) to accelerate protection, explore the effectiveness of minimizing semantic loss, and conduct extensive experiments on various mimicry scenarios (SDEdit, inpainting, textual inversion) to evaluate their proposed strategies.	Key findings include: (1) The encoder of an LDM is significantly more vulnerable to attacks than the denoiser module. (2) Minimizing semantic loss can be an effective protection strategy, producing more natural perturbations compared to maximizing it. (3) SDS accelerates protection by 50% without sacrificing effectiveness. (4) The proposed strategies outperform existing methods in terms of protection strength, perturbation naturalness, and computational efficiency.	The paper mainly focuses on LDMs and future work could explore attacks on pixel-based diffusion models. Additionally, investigating the robustness of the proposed protections against various defense methods is crucial for real-world deployment.	diffusion_model, ldm, adversarial_attack, image_protection, mimicry, sds, semantic_loss
2405.01356	Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance	Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang	In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt. In this work, we propose Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance, one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally, we demonstrate its applicability in second-order customization methods, where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements, as evidenced by our evaluations and user studies.	This paper presents Subject-Agnostic Guidance (SAG), a method for subject-driven text-to-image synthesis that addresses the issue of models overlooking text prompts in favor of matching subject images by balancing subject fidelity with adherence to text descriptions.	This paper is important because it tackles the problem of "content ignorance" in subject-driven text-to-image synthesis, where models often prioritize mimicking the subject image over following the text prompt. The proposed SAG method offers a simple yet effective solution to improve text alignment without sacrificing subject fidelity, thereby enhancing the quality and diversity of generated images.	The authors propose Subject-Agnostic Guidance (SAG) which constructs a subject-agnostic embedding from the user input and utilizes a dual classifier-free guidance (DCFG) technique. DCFG leverages both the subject-aware and subject-agnostic embeddings to guide the generation process towards a more balanced output. The method is validated by applying it to various existing synthesis approaches including optimization-based and encoder-based methods, as well as in second-order customization using DreamBooth.	The paper demonstrates that SAG effectively improves text alignment in generated images while maintaining high subject fidelity. Evaluations using CLIP and DINO scores show improvements in both text and subject similarity. User studies also confirm the effectiveness of SAG, with a majority of users preferring the generated results over existing methods like DreamBooth, Textual Inversion, and ELITE.	The authors acknowledge that the quality of outputs still relies on the underlying generative model and may be suboptimal for complex or uncommon content. Future work could explore incorporating more robust synthesis networks. Additionally, they emphasize the ethical implications of such technology, particularly its potential for misuse. Future research should address these concerns by developing detection mechanisms to prevent the spread of misinformation.	diffusion_model, text-to-image, image_synthesis, subject-driven, classifier-free_guidance, dreambooth, textual_inversion
2404.03913	Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models	Gihyun Kwon, Simon Jenni, Dingzeyu Li, Joon-Young Lee, Jong Chul Ye, Fabian Caba Heilbron	While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.	This paper introduces Concept Weaver, a novel method for generating images with multiple customized concepts by combining personalized text-to-image diffusion models at inference time using a template image and a concept fusion strategy.	This paper addresses the challenge of generating images with multiple personalized concepts, which is important for enabling more creative and diverse content creation using text-to-image generation models. Concept Weaver offers advantages over previous approaches by improving concept fidelity, handling more concepts, and closely following the semantics of input prompts.	Concept Weaver involves five steps: (1) fine-tuning a pre-trained text-to-image model for each target concept, (2) generating a non-personalized template image, (3) extracting latent representations from the template image, (4) identifying regions corresponding to target concepts in the template image, and (5) fusing the latent representations, targeted regions, and personalized models to reconstruct the template image with the desired concepts.	Concept Weaver demonstrates superior performance in generating multiple custom concepts with higher fidelity than baseline methods. It effectively handles more than two concepts, preserves the appearance of semantically related concepts without blending, and achieves high CLIP scores, indicating better text-image alignment. Furthermore, it's flexible enough to be used with both full fine-tuning and Low-Rank adaptation strategies.	The paper mentions limitations in generating images from extremely complex or unrealistic text prompts due to limitations in the pre-trained Stable Diffusion model. Future work could focus on addressing this by using improved diffusion model backbones. Additionally, ethical concerns regarding the potential misuse of the technology for generating privacy-sensitive content are acknowledged, suggesting a need for appropriate content filtering systems.	diffusion_model, text-to-image, image_generation, multi-concept, personalization, concept fusion, lora, clip
2404.05717	SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing	Jing Gu, Yilin Wang, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang	Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged. Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image. First, we propose targeted variable swapping to apply region control over latent feature maps and swap masked variables for faithful context preservation and initial semantic concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt the semantic concept into the original image in terms of target location, shape, style, and content during the image generation process. Extensive results on both human and automatic evaluation demonstrate significant improvements of our approach over baseline methods on personalized swapping. Furthermore, SwapAnything shows its precise and faithful swapping abilities across single object, multiple objects, partial object, and cross-domain swapping tasks. SwapAnything also achieves great performance on text-based swapping and tasks beyond swapping such as object insertion.	This paper introduces \modelname{}, a novel framework for personalized object swapping in images using pre-trained diffusion models, enabling precise replacement of arbitrary objects with personalized concepts while preserving the background context.	This paper is important as it addresses limitations in existing personalized image editing techniques, enabling precise and localized swapping of arbitrary objects while maintaining stylistic consistency and preserving background context, with potential applications in e-commerce, entertainment, and professional editing.	The authors propose a \modelname{} framework that leverages pre-trained diffusion models. They introduce 'targeted variable swapping' for precise object replacement and 'appearance adaptation' to seamlessly integrate the new object into the source image's style, scale, and content, ensuring a cohesive visual result.	\modelname{} demonstrates superior performance in personalized object swapping tasks, including single-object, multi-object, partial-object, and cross-domain swapping, as evidenced by human and automatic evaluations. It outperforms baselines in preserving background context, accurately swapping object identities, and maintaining overall image quality. Furthermore, \modelname{} exhibits promising results in text-based swapping and object insertion tasks.	The authors acknowledge limitations in reconstructing intricate details within the masked area and handling objects with high degrees of freedom. Future work will focus on addressing these limitations by incorporating explicit alignment mechanisms and extending the framework to 3D/video object swapping.	diffusion_model, image_editing, object_swapping, personalized_editing, appearance_adaptation, context_preservation, text-based_editing
2310.13267	On the Language Encoder of Contrastive Cross-modal Models	Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji	Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. In contrast, AL pretraining benefits less from sentence embedding training, which may result from the limited amount of pretraining data. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment.	This paper investigates the impact of incorporating sentence embedding training, both unsupervised and supervised, during the pretraining of contrastive cross-modal models like CLIP and CLAP for vision-language (VL) and audio-language (AL) tasks.	This paper addresses the crucial need to improve the language understanding capabilities of cross-modal models, especially as these models are increasingly being pretrained on massive datasets. By focusing on enhancing the language encoder through sentence embedding training, the authors aim to boost the performance of these models on a variety of tasks.	The authors pretrain VL and AL models with different combinations of training objectives, including cross-modal contrastive loss, cyclic losses for cross-modal and in-modal consistency, and unsupervised/supervised sentence embedding losses. They evaluate the pretrained models on tasks like zero-shot image/audio classification, image-text/audio-text retrieval, and SentEval benchmark. Additionally, they analyze the representation spaces of the trained models in terms of alignment and uniformity.	The results show that unsupervised sentence embedding training generally improves both the language encoder quality and the performance on VL tasks, leading to a better CyCLIP model. However, the benefits are less pronounced and noisier in AL pretraining, possibly due to the limited size of AL datasets and the use of pretrained encoders. The analysis of representation spaces reveals that sentence embedding training enhances the uniformity of the text representation space, but at the cost of slightly decreased cross-modal alignment.	The authors acknowledge limitations in terms of modality scope (excluding music), the use of pretrained encoders for AL pretraining, and the lack of extensive prompt engineering for audio. Future work could address these limitations by incorporating the music modality, exploring pretraining strategies that adapt language encoders to the audio domain, and investigating prompt engineering techniques specifically for audio-language tasks.	diffusion_model, analysis, llm, audio, video
2401.17879	AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error	Jonas Ricker, Denis Lukovnikov, Asja Fischer	With recent text-to-image models, anyone can generate deceptively realistic images with arbitrary contents, fueling the growing threat of visual disinformation. A key enabler for generating high-resolution images with low computational cost has been the development of latent diffusion models (LDMs). In contrast to conventional diffusion models, LDMs perform the denoising process in the low-dimensional latent space of a pre-trained autoencoder (AE) instead of the high-dimensional image space. Despite their relevance, the forensic analysis of LDMs is still in its infancy. In this work we propose AEROBLADE, a novel detection method which exploits an inherent component of LDMs: the AE used to transform images between image and latent space. We find that generated images can be more accurately reconstructed by the AE than real images, allowing for a simple detection approach based on the reconstruction error. Most importantly, our method is easy to implement and does not require any training, yet nearly matches the performance of detectors that rely on extensive training. We empirically demonstrate that AEROBLADE is effective against state-of-the-art LDMs, including Stable Diffusion and Midjourney. Beyond detection, our approach allows for the qualitative analysis of images, which can be leveraged for identifying inpainted regions. We release our code and data at https://github.com/jonasricker/aeroblade .	This paper introduces AEROBLADE, a novel method for detecting images generated by Latent Diffusion Models (LDMs) by exploiting the reconstruction error of the autoencoder (AE) used in the LDM pipeline.	The paper addresses the growing threat of visual disinformation fueled by the increasing realism and accessibility of AI-generated images. AEROBLADE provides a simple, training-free, and effective method for detecting these images, which is crucial for combating misinformation.	The authors leverage the observation that LDMs' AEs reconstruct generated images more accurately than real images. They calculate the reconstruction error between an input image and its reconstruction after passing through the LDM's AE. By comparing the error against a threshold, AEROBLADE can determine if an image is real or generated. The authors evaluate AEROBLADE on a dataset of images generated by various LDMs and compare its performance against existing detection methods.	AEROBLADE achieves high detection accuracy (average precision of 0.992) on a dataset of images generated by state-of-the-art LDMs, including Stable Diffusion and Midjourney, even without access to the generator's specific AE. The method's performance is comparable to deep learning-based detectors that require extensive training. Additionally, the authors demonstrate that AEROBLADE can be used for qualitative image analysis, such as identifying inpainted regions in real images.	The authors acknowledge that AEROBLADE's performance is best when the specific AE of the LDM used for generation is known. Future work includes exploring the use of more robust distance metrics and training a classifier on top of the reconstruction errors to enhance robustness against image perturbations. Additionally, they aim to investigate the potential of using reconstruction errors for precise localization of inpainted regions.	diffusion_model, ldm, analysis, adversarial_attack, interpretability, detection, disinformation, autoencoder
2404.07993	Connecting NeRFs, Images, and Text	Francesco Ballerini, Pierluigi Zama Ramirez, Roberto Mirabella, Samuele Salti, Luigi Di Stefano	Neural Radiance Fields (NeRFs) have emerged as a standard framework for representing 3D scenes and objects, introducing a novel data type for information exchange and storage. Concurrently, significant progress has been made in multimodal representation learning for text and image data. This paper explores a novel research direction that aims to connect the NeRF modality with other modalities, similar to established methodologies for images and text. To this end, we propose a simple framework that exploits pre-trained models for NeRF representations alongside multimodal models for text and image processing. Our framework learns a bidirectional mapping between NeRF embeddings and those obtained from corresponding images and text. This mapping unlocks several novel and useful applications, including NeRF zero-shot classification and NeRF retrieval from images or text.	This paper introduces a novel framework connecting Neural Radiance Fields (NeRFs) with other modalities like text and images, enabling applications such as zero-shot NeRF classification and NeRF retrieval from images or text.	This research is significant as it explores NeRFs as a data format and bridges the gap between NeRFs and existing multimodal representation learning techniques for images and text, opening up new possibilities for 3D scene understanding and interaction.	The authors propose a framework that leverages pre-trained models like CLIP for multimodal embeddings and NF2Vec for NeRF embeddings. They train two MLPs to learn bidirectional mappings between these embedding spaces, enabling the connection between NeRFs, images, and text.	The framework achieves promising results on tasks like zero-shot NeRF classification, outperforming baselines relying on rendered images. It also demonstrates strong performance in NeRF retrieval from both images and text, highlighting the effectiveness of the learned mappings. Notably, the authors propose an adaptation technique using ControlNet to improve performance on real images when trained solely on synthetic data.	The paper acknowledges limitations regarding the current focus on synthetic objects due to the NF2Vec encoder's training data and the generation capabilities being restricted by the NF2Vec decoder. Future work aims to extend the framework to real-world scenes and objects, explore larger datasets, and investigate joint training of encoders for a shared latent space.	nerf, diffusion_model, gan, analysis, 3d, multimodal, retrieval, zero-shot, representation_learning
2404.05384	Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance	Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, Yu Liu	Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.	This paper introduces Semantic-aware Classifier-Free Guidance (S-CFG), a novel approach for enhancing text-to-image diffusion models by dynamically adjusting guidance degrees for different semantic regions within an image during the denoising process.	The paper addresses the limitations of the conventional global CFG scale, which often leads to spatial inconsistencies in image quality and varying semantic strengths. By customizing guidance for different semantic units, S-CFG aims to improve the overall image quality and better align the generation with text prompts.	The authors propose a two-step method: 1) Segmenting the latent image into semantic regions using a training-free method based on cross-attention and self-attention maps from the U-net backbone. 2) Adaptively adjusting CFG scales for each region to unify the classifier score norm, thereby balancing the amplification of various semantic units.	Experiments on different diffusion models (Stable Diffusion v1.5/v2.1, DeepFloyd IF) demonstrate that S-CFG consistently outperforms the original CFG method in terms of FID-30K and CLIP Score. Qualitative results showcase notable improvements in semantic expressiveness, entity portrayal, and fine-grained details. Ablation studies highlight the effectiveness of key components like self-attention-based segmentation completion and foreground region benchmarking.	The paper acknowledges that the assumption of independence among semantic units might not always hold true. Future work could explore more sophisticated methods for modeling interdependencies between regions. Further investigation into the impact of different benchmark regions and the generalizability of S-CFG to other diffusion models and downstream tasks is also suggested.	diffusion_model, text-to-image, cfg, semantic_segmentation, attention_map, image_quality
2310.00426	PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis	Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li	The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.	This paper introduces a novel Transformer-based text-to-image diffusion model called \model, which achieves high-quality image generation comparable to state-of-the-art models while significantly reducing training costs and CO2 emissions.	This work is important because it addresses the high training costs and environmental impact associated with advanced T2I models, hindering innovation and accessibility in the AIGC community.	The authors propose three core designs: (1) decomposing the training strategy into pixel dependency learning, text-image alignment learning, and high-resolution aesthetic image generation; (2) developing an efficient T2I Transformer based on DiT with cross-attention and a streamlined class-condition branch; and (3) utilizing high-informative data from SAM with dense pseudo-captions generated by LLaVA.	Key findings include achieving a COCO FID score of 7.32 with only 12% of Stable Diffusion v1.5's training time, outperforming other models in user studies for image quality and alignment, and demonstrating superior performance in compositionality on T2I-CompBench.	Limitations include challenges in accurately controlling the number of generated targets, handling specific details like human hands, and limited text generation capabilities. Future work involves addressing these limitations and exploring personalized extensions.	diffusion_model, t2i, transformer, image_generation, efficient_training, llava, sam, controlnet, dreambooth
2311.13833	Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models	Saman Motamed, Danda Pani Paudel, Luc Van Gool	Diffusion models have revolutionized generative content creation and text-to-image (T2I) diffusion models in particular have increased the creative freedom of users by allowing scene synthesis using natural language. T2I models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting more general concepts that go beyond object appearance and style (adjectives and verbs) through natural language, remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding and 2) describing such concepts often extends beyond single word embeddings (being frozen in ice, walking on a tightrope, etc.) that current methods do not handle. In this study, we introduce Lego, a textual inversion method designed to invert subject entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline. Additionally, visual question answering using a large language model suggested Lego-generated concepts are better aligned with the text description of the concept.	This paper introduces Lego, a novel textual inversion method designed to disentangle and invert general concepts (adjectives and verbs) from a few example images in text-to-image diffusion models.	This work addresses the limitations of existing text-to-image models in synthesizing complex concepts that go beyond object appearance. It is significant because it enables greater user control over image generation by allowing the inversion of subject-entangled concepts, such as melting or walking, which were previously challenging for traditional inversion methods.	Lego builds upon Textual Inversion (TI) by adding two key components: 1) Subject Separation, which uses a dedicated embedding to isolate the subject's appearance from the concept, preventing feature leakage. 2) Contrastive Context Guidance, which utilizes an InfoNCE-based loss to guide the learning of multiple embeddings representing the concept by steering them towards synonyms and away from antonyms of descriptive words.	Lego demonstrates superior performance compared to existing methods, including DreamBooth, Custom Diffusion, and natural language prompts, in accurately representing and synthesizing complex concepts. Human evaluation and Visual Question Answering using a large language model confirm that Lego-generated images better capture and convey the intended concepts.	The authors acknowledge limitations in inverting concepts that exceed the capabilities of the base diffusion model, such as facial expressions in earlier Stable Diffusion versions. Future work includes exploring the inversion of dynamic concepts from example videos and ensuring ethical application of personalized visual media generation.	diffusion_model, textual_inversion, concept_learning, image_generation, disentanglement, contrastive_learning
2309.12314	TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance	Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi, Chen, Xinggang Wang, Hongyang Chao, Han Hu	In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8 $\times$ compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at https://aka.ms/tinyclip.	This paper introduces TinyCLIP, a novel cross-modal distillation method designed to compress large-scale language-image pre-trained models like CLIP while preserving their zero-shot performance.	The paper addresses the limitations of large language-image models, such as CLIP, which require significant storage, memory, and computational resources. TinyCLIP offers a solution by compressing these models, making them more practical for real-world applications without compromising performance.	TinyCLIP utilizes two key techniques: affinity mimicking and weight inheritance. Affinity mimicking enables student models to learn cross-modal feature alignment by mimicking the teacher model's behavior in a visual-linguistic affinity space. Weight inheritance accelerates distillation by transferring pre-trained weights from teacher to student models, either manually or automatically using learnable masks. TinyCLIP employs a multi-stage progressive distillation process for high compression rates, gradually reducing model size while retaining important weights and knowledge.	TinyCLIP achieves impressive compression rates while maintaining competitive performance on various benchmarks. For example, TinyCLIP ViT-8M/16 surpasses the original CLIP ViT-B/16 on ImageNet zero-shot top-1 accuracy despite having significantly fewer parameters. Additionally, TinyCLIP demonstrates faster training times compared to training from scratch and shows strong transfer learning capabilities in zero-shot and linear-probe classification tasks.	The paper acknowledges that further research is needed to enhance cross-modal distillation efficiency for even smaller models. Future work could explore alternative compression techniques or investigate methods to optimize the trade-off between model size, speed, and accuracy.	diffusion_model, llm, analysis, 3d, video, interpretability
2403.17804	Improving Text-to-Image Consistency via Automatic Prompt Optimization	Oscar Mañas, Pietro Astolfi, Melissa Hall, Candace Ross, Jack Urbanek, Adina Williams, Aishwarya Agrawal, Adriana Romero-Soriano, Michal Drozdzal	Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.	This paper introduces OPT2I, an optimization-by-prompting framework for text-to-image (T2I) models that improves prompt-image consistency without requiring parameter updates or training data.	Despite advancements in image quality, T2I models often struggle with accurately representing all elements from the input text prompt in the generated image. This paper is important because it addresses this challenge by leveraging large language models (LLMs) to iteratively refine user prompts and enhance the consistency between the text input and the visual output.	OPT2I employs an LLM in conjunction with a pre-trained T2I model and a prompt-image consistency metric (either decomposed CLIPScore or Davidsonian Scene Graph score). The LLM receives an initial user prompt and iteratively generates revised prompts, aiming to maximize the consistency score. The framework then uses the best-performing prompts as in-context examples for subsequent iterations, gradually improving the alignment between the generated images and the user's intent.	Experimental results demonstrate that OPT2I effectively improves prompt-image consistency across various LLMs, T2I models, and datasets (MSCOCO and PartiPrompts). Notably, OPT2I achieves up to 24.9% improvement in consistency while preserving image quality (FID) and enhancing image diversity (recall). Qualitative analysis suggests that the optimized prompts tend to emphasize initially overlooked visual elements by either providing more detailed descriptions or repositioning them within the prompt.	The paper acknowledges limitations in existing prompt-image consistency metrics, which might not always accurately capture complex relationships or could be susceptible to adversarial examples. The authors suggest further research on more robust consistency metrics as a direction for future work. Another limitation is the computational cost associated with the iterative optimization process.	diffusion_model, llm, analysis, text-to-image, interpretability, prompt_engineering, consistency
2403.07691	ORPO: Monolithic Preference Optimization without Reference Model	Jiwoo Hong, Noah Lee, James Thorne	While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on $\text{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-$\alpha$ (7B) and Mistral-ORPO-$\beta$ (7B).	This paper investigates the crucial role of supervised fine-tuning (SFT) in preference alignment for language models and introduces ORPO, a novel monolithic odds ratio preference optimization algorithm that eliminates the need for a separate preference alignment phase.	This work is significant as it simplifies preference alignment, improves efficiency, and enhances performance compared to existing multi-stage methods like RLHF and DPO. It sheds light on the understudied role of SFT in preference alignment and offers a more streamlined approach.	The authors conduct experiments fine-tuning various language models (OPT, Phi-2, Llama-2, Mistral) using ORPO on preference datasets like HH-RLHF and UltraFeedback. They compare ORPO's performance with SFT, RLHF, and DPO across various model sizes and evaluate instruction-following abilities using AlpacaEval and MT-Bench.	Key findings include that a minor penalty for disfavored generation styles during SFT is sufficient for preference alignment. ORPO outperforms SFT, RLHF, and DPO in reward model win rates and achieves state-of-the-art results on AlpacaEval and MT-Bench, exceeding even larger language models.	Limitations include the need for comparison with a wider range of preference alignment algorithms and scaling beyond 7B models. Future work involves exploring diverse datasets, analyzing ORPO's impact on pre-trained models, and expanding to other NLP tasks.	diffusion_model, llm, analysis, preference_alignment, sft, instruction_following, rlhf, dpo
2405.06535	Controllable Image Generation With Composed Parallel Token Prediction	Jamie Stirling, Noura Al-Moubayed	Compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training (compositional generalisation). Despite recent progress in compositional image generation via composing continuous sampling processes such as diffusion and energy-based models, composing discrete generative processes has remained an open challenge, with the promise of providing improvements in efficiency, interpretability and simplicity. To this end, we propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space. Our approach, when applied alongside VQ-VAE and VQ-GAN, achieves state-of-the-art generation accuracy in three distinct settings (FFHQ, Positional CLEVR and Relational CLEVR) while attaining competitive Fr\'echet Inception Distance (FID) scores. Our method attains an average generation accuracy of $80.71\%$ across the studied settings. Our method also outperforms the next-best approach (ranked by accuracy) in terms of FID in seven out of nine experiments, with an average FID of $24.23$ (an average improvement of $-9.58$). Furthermore, our method offers a $2.3\times$ to $12\times$ speedup over comparable continuous compositional methods on our hardware. We find that our method can generalise to combinations of input conditions that lie outside the training data (e.g. more objects per image) in addition to offering an interpretable dimension of controllability via concept weighting. We further demonstrate that our approach can be readily applied to an open pre-trained discrete text-to-image model without any fine-tuning, allowing for fine-grained control of text-to-image generation.	This paper presents a novel method for controllable compositional image generation using discrete generative models, achieving state-of-the-art accuracy by composing log-probability outputs from models like VQ-VAE and VQ-GAN.	This paper is important as it enables the composition of discrete generative processes for image generation, unlike previous methods focused on continuous models. This allows for benefits such as improved efficiency, interpretability, and controllability, which are demonstrated through state-of-the-art results on multiple datasets.	The authors derive a formulation for composing discrete generation processes by leveraging the product of conditional probabilities of individual concepts, assuming their independence. They apply this to parallel token prediction, generating images by iteratively unmasking discrete image representations conditioned on multiple input attributes using VQ-VAE/VQ-GAN. They further introduce concept weighting to control the relative importance of different conditions.	The proposed method achieves state-of-the-art generation accuracy on FFHQ, Positional CLEVR, and Relational CLEVR datasets, surpassing previous methods while maintaining competitive FID scores. It also demonstrates strong generalization ability, including out-of-distribution generation and concept negation, while being significantly faster than comparable continuous compositional methods.	The authors acknowledge the limitations of assuming independence between input conditions and the increased computational cost compared to non-compositional approaches. Future work could explore methods for handling condition dependencies and optimizing concept weighting.	diffusion_model, gan, vq-vae, vq-gan, analysis, image_generation, compositionality, discrete_models, parallel_token_prediction, controllable_generation
2404.03620	LCM-Lookahead for Encoder-based Text-to-Image Personalization	Rinon Gal, Or Lichter, Elad Richardson, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or	Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.	This paper introduces a novel approach called LCM-Lookahead for enhancing encoder-based text-to-image personalization, specifically focusing on improving identity preservation and prompt alignment in generated facial images.	This paper addresses the limitations of existing encoder-based personalization methods that often struggle to maintain identity fidelity and struggle with prompt alignment, particularly in stylized images, by proposing a novel training scheme and a shortcut mechanism to incorporate image-space losses during training.	The authors leverage a fast-sampling Latent Consistency Model (LCM) as a 'shortcut' to preview the final denoised image during training. This preview is used to calculate an identity loss, providing a better training signal for identity preservation. They also introduce an attention-sharing mechanism to transfer visual features from the conditioning image and generate a consistent synthetic dataset using SDXL-Turbo to improve prompt alignment.	The proposed method demonstrates superior performance in preserving facial identity and aligning with textual prompts, even in stylized images, compared to existing state-of-the-art encoder-based methods. Both quantitative and qualitative evaluations, including a user study, confirm the effectiveness of their approach.	The authors acknowledge limitations in handling out-of-domain images and potential biases inherited from the backbone model. Future work involves exploring optimization-based methods on top of their approach to further enhance quality and address potential ethical concerns related to facial editing technology.	diffusion_model, personalization, face_generation, lcm, analysis, attention_mechanism, image_generation
2310.04687	Improving Adversarial Attacks on Latent Diffusion Model	Boyang Zheng, Chumeng Liang, Xiaoyu Wu, Yan Liu	Adversarial attacks on Latent Diffusion Model (LDM), the state-of-the-art image generative model, have been adopted as effective protection against malicious finetuning of LDM on unauthorized images. We show that these attacks add an extra error to the score function of adversarial examples predicted by LDM. LDM finetuned on these adversarial examples learns to lower the error by a bias, from which the model is attacked and predicts the score function with biases. Based on the dynamics, we propose to improve the adversarial attack on LDM by Attacking with Consistent score-function Errors (ACE). ACE unifies the pattern of the extra error added to the predicted score function. This induces the finetuned LDM to learn the same pattern as a bias in predicting the score function. We then introduce a well-crafted pattern to improve the attack. Our method outperforms state-of-the-art methods in adversarial attacks on LDM.	This paper investigates adversarial attacks on Latent Diffusion Models (LDMs) and proposes a new method, Attacking with Consistent Errors (ACE), to improve their effectiveness in disrupting LDM finetuning for few-shot generation.	This paper is important because it reveals a novel dynamic of adversarial attacks on LDMs, explaining how these attacks disrupt finetuning, and proposes a more effective attack method (ACE) to protect images from unauthorized copying or malicious use in LDM-based few-shot generation.	The authors analyze the score-function errors of adversarial examples and identify a "reverse bias" in LDMs finetuned on such examples. They then propose ACE, which manipulates adversarial examples to induce a consistent error pattern, leading to predictable and optimizable sampling biases in the finetuned LDM. Experiments on SDEdit and LoRA pipelines, using CelebA-HQ and WikiArt datasets, demonstrate ACE's superior performance over existing methods.	The proposed ACE method outperforms existing adversarial attacks on LDMs in disrupting both SDEdit and LoRA, two leading few-shot generation pipelines. ACE achieves this by inducing a consistent, optimizable pattern of errors in the finetuned LDM, leading to significant degradation in the quality of generated images. The paper also provides insights into the dynamics of adversarial attacks on LDMs, particularly the role of "reverse bias" in amplifying the impact of adversarial examples during finetuning.	The authors acknowledge that the optimal target for maximizing the impact of ACE is still an open question and suggest exploring different target options in future work. Additionally, they plan to investigate the generalization of ACE to other LDM-based generative models and explore its robustness against potential defense mechanisms.	diffusion_model, adversarial_attack, interpretability, ldm, few-shot generation, image_generation
2404.05961	LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders	Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy	Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.	This paper introduces LLM2Vec, an unsupervised approach for converting large decoder-only language models (LLMs) into effective text encoders by enabling bidirectional attention, incorporating masked next token prediction training, and applying unsupervised contrastive learning.	This paper is important because it addresses the limitations of causal attention in decoder-only LLMs for text embedding tasks, offering a simple and efficient method to enhance their performance and compete with or surpass encoder-only models.	The authors develop LLM2Vec, a three-step approach consisting of: 1) enabling bidirectional attention in decoder-only LLMs, 2) adapting the models using masked next token prediction (MNTP) training, and 3) enhancing sequence representation learning through unsupervised contrastive learning with SimCSE. They apply LLM2Vec to three LLMs (Sheared-LLaMA-1.3B, Llama-2-7B-chat, and Mistral-7B-Instruct-v0.2) and evaluate their performance on word- and sequence-level tasks using benchmarks such as CoNLL-2003 and MTEB.	LLM2Vec-transformed models demonstrate substantial improvements on both word- and sequence-level tasks. Notably, they outperform strong encoder-only baselines on word-level tasks and achieve state-of-the-art results among unsupervised models on the MTEB benchmark. The authors also find that Mistral models inherently possess a degree of bidirectional attention, contributing to their strong performance.	The authors acknowledge limitations regarding the computational demands of large LLMs and potential data contamination from pre-training. Future work could focus on mitigating these limitations by exploring techniques for efficient training and inference of large models, and evaluating on novel benchmarks to address data contamination concerns. Additionally, extending LLM2Vec to other languages beyond English presents a promising research direction.	diffusion_model, llm, analysis, text_embedding, contrastive_learning
2310.07204	State of the Art on Diffusion Models for Visual Computing	Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T. Barron, Amit H. Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, C. Karen Liu, Lingjie Liu, Ben Mildenhall, Matthias Nießner, Björn Ommer, Christian Theobalt, Peter Wonka, Gordon Wetzstein	The field of visual computing is rapidly advancing due to the emergence of generative artificial intelligence (AI), which unlocks unprecedented capabilities for the generation, editing, and reconstruction of images, videos, and 3D scenes. In these domains, diffusion models are the generative AI architecture of choice. Within the last year alone, the literature on diffusion-based tools and applications has seen exponential growth and relevant papers are published across the computer graphics, computer vision, and AI communities with new works appearing daily on arXiv. This rapid growth of the field makes it difficult to keep up with all recent developments. The goal of this state-of-the-art report (STAR) is to introduce the basic mathematical concepts of diffusion models, implementation details and design choices of the popular Stable Diffusion model, as well as overview important aspects of these generative AI tools, including personalization, conditioning, inversion, among others. Moreover, we give a comprehensive overview of the rapidly growing literature on diffusion-based generation and editing, categorized by the type of generated medium, including 2D images, videos, 3D objects, locomotion, and 4D scenes. Finally, we discuss available datasets, metrics, open challenges, and social implications. This STAR provides an intuitive starting point to explore this exciting topic for researchers, artists, and practitioners alike.	This state-of-the-art report provides a comprehensive overview of diffusion models for visual computing, focusing on their applications in generating and editing images, videos, 3D objects, and 4D scenes.	Diffusion models have revolutionized visual computing by enabling unprecedented capabilities for content creation and editing. This report is crucial for researchers, artists, and practitioners to understand the fundamentals, advancements, and open challenges in this rapidly evolving field.	The report presents the mathematical foundations of diffusion models, discusses practical implementations using the Stable Diffusion model, and explores conditioning, guidance, inversion, editing, and customization techniques. It then categorizes and summarizes recent advancements in diffusion models for video, 3D, and 4D content generation, highlighting key methodologies and applications.	The report highlights the significant advancements in diffusion models, showcasing their ability to generate realistic and creative content across various modalities. Key findings include the effectiveness of latent diffusion models, score distillation sampling for 3D generation, and the emergence of 4D spatio-temporal diffusion for dynamic scenes.	The report outlines open challenges including: the need for better evaluation metrics, the scarcity of high-quality training data for 3D, video, and 4D content, the computational inefficiency of diffusion models, and the need for improved controllability and user interfaces. Future work may focus on addressing these challenges, exploring new applications, and improving robustness, reproducibility, and ethical considerations.	diffusion_model, gan, analysis, literature_review, 2d, 3d, motion, video, 4d, text-to-image, text-to-video
2401.06209	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie	Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.	This paper explores the limitations of visual capabilities in Multimodal Large Language Models (MLLMs) that stem from the visual encoder, particularly CLIP, and proposes a Mixture of Features (MoF) approach to enhance visual grounding by integrating features from CLIP and vision-only self-supervised learning models.	This paper is important because it sheds light on a crucial weakness in current state-of-the-art MLLMs, despite their impressive language capabilities, and proposes a potential solution to improve their visual grounding for more robust and reliable performance.	The authors first identify "CLIP-blind pairs" - images perceived as similar by CLIP despite visual differences - and construct the Multimodal Visual Patterns (MVP) benchmark to evaluate MLLMs' visual grounding. Then, they analyze systematic visual patterns in CLIP-blind pairs and propose MoF, experimenting with Additive MoF (linearly mixing features) and Interleaved MoF (spatially mixing visual tokens) to enhance visual grounding in MLLMs.	Key findings include: (1) MLLMs, even the most advanced ones, struggle with seemingly simple visual questions in the MVP benchmark. (2) Scaling up CLIP's training data and model size alone doesn't resolve challenges related to certain visual patterns. (3) A strong correlation exists between CLIP's failure patterns and MLLMs' visual incapability. (4) Integrating vision-only SSL features using MoF, particularly Interleaved MoF, significantly improves MLLMs' visual grounding without compromising instruction-following abilities.	The authors acknowledge that MoF is an initial step and more sophisticated approaches are needed to fully address the visual limitations. Future work includes exploring advanced fusion techniques beyond linear and spatial mixing, designing more comprehensive benchmarks to evaluate diverse visual patterns and grounding abilities, and investigating new visual representation learning algorithms that better capture fine-grained visual details and relationships.	diffusion_model, llm, analysis, benchmark, visual grounding, clip, self-supervised learning, multimodal, vision-and-language, representation learning
2403.13187	Evolutionary Optimization of Model Merging Recipes	Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha	We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.	This paper introduces a novel approach, Evolutionary Model Merge, which utilizes evolutionary algorithms to automate the merging of open-source foundation models, enabling the creation of new models with combined capabilities without the need for extensive training.	This paper is important because it presents a more efficient and accessible method for developing foundation models, particularly for specialized domains and non-English languages, by leveraging the collective intelligence of existing open-source models.	The authors employ evolutionary algorithms to optimize model merging in two spaces: parameter space (PS) for combining model weights and data flow space (DFS) for optimizing token inference paths through model layers. They demonstrate their method by evolving a Japanese LLM with Math reasoning capabilities and a culturally-aware Japanese VLM.	The evolved Japanese LLM achieves state-of-the-art performance on Japanese LLM benchmarks, surpassing some 70B parameter models despite having only 7B parameters. Similarly, the evolved Japanese VLM excels in handling culturally-specific content, outperforming existing Japanese VLMs on a newly created benchmark.	Limitations include potential for illogical outputs and lack of instruction fine-tuning. Future work involves applying the method to image generation, evolving source model selection, and developing a self-improving swarm of models.	diffusion_model, llm, analysis, evolutionary_algorithm, model_merging, japanese, multi-modal, vlm
2311.17082	DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling	Linqi Zhou, Andy Shih, Chenlin Meng, Stefano Ermon	Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However, the long generation time of such algorithms significantly degrades the user experience. To tackle this problem, we propose DreamPropeller, a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations, a classical algorithm for parallel sampling an ODE path, and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks.	This paper introduces DreamPropeller, a method for accelerating text-to-3D generation using score distillation by generalizing Picard iterations to handle complex computation graphs and leveraging parallel compute.	The paper addresses the slow generation time of existing text-to-3D methods that utilize score distillation, which hinders their practical use despite high generation quality.	The authors generalize Picard iterations, a parallel ODE solving technique, to handle the intricacies of 3D generation, such as momentum-based updates and varying dimensionality, and apply this generalized framework to accelerate existing score distillation methods.	DreamPropeller achieves up to 4.7x speedup across various 3D representations and score distillation techniques, including NeRF, DMTet, SDF, and 3D Gaussian Splatting, with negligible drop in generation quality measured by CLIP R-Precision and FID.	The paper acknowledges limitations in perfectly matching baseline quality due to the fixed-point error and suggests exploring alternative distance metrics or adaptive optimization strategies for further improvement. Future work may also involve investigating the application of DreamPropeller to other domains beyond 3D generation.	diffusion_model, 3d, acceleration, score_distillation, nerf, gaussian_splatting, sds, vsd
2405.01008	On Mechanistic Knowledge Localization in Text-to-Image Generative Models	Samyadeep Basu, Keivan Rezaei, Priyatham Kattakinda, Ryan Rossi, Cherry Zhao, Vlad Morariu, Varun Manjunatha, Soheil Feizi	Identifying layers within text-to-image models which control visual attributes can facilitate efficient model editing through closed-form updates. Recent work, leveraging causal tracing show that early Stable-Diffusion variants confine knowledge primarily to the first layer of the CLIP text-encoder, while it diffuses throughout the UNet.Extending this framework, we observe that for recent models (e.g., SD-XL, DeepFloyd), causal tracing fails in pinpointing localized knowledge, highlighting challenges in model editing. To address this issue, we introduce the concept of Mechanistic Localization in text-to-image models, where knowledge about various visual attributes (e.g., "style", "objects", "facts") can be mechanistically localized to a small fraction of layers in the UNet, thus facilitating efficient model editing. We localize knowledge using our method LocoGen which measures the direct effect of intermediate layers to output generation by performing interventions in the cross-attention layers of the UNet. We then employ LocoEdit, a fast closed-form editing method across popular open-source text-to-image models (including the latest SD-XL)and explore the possibilities of neuron-level model editing. Using Mechanistic Localization, our work offers a better view of successes and failures in localization-based text-to-image model editing. Code will be available at https://github.com/samyadeepbasu/LocoGen.	This paper investigates the localization of knowledge within text-to-image generative models, particularly focusing on identifying specific layers responsible for controlling visual attributes like "style", "objects", and "facts".	This work is crucial as it offers a deeper understanding of how knowledge is represented within these complex models, facilitating efficient model editing techniques for tasks like removing specific styles, modifying objects, or updating factual information.	The authors first analyze the effectiveness of causal tracing in localizing knowledge across various text-to-image models, including SD-XL and DeepFloyd. They then introduce \crossprompt{}, a novel method to pinpoint control regions for visual attributes by intervening in the cross-attention layers of the UNet. Subsequently, they employ \crossedit{}, a closed-form editing method, to manipulate the identified locations and evaluate its effectiveness.	The research demonstrates that \crossprompt{} successfully identifies unique locations controlling visual attributes across different text-to-image models. Moreover, \crossedit{} effectively implements edits at these locations for most models, except DeepFloyd, which exhibits limitations due to its bi-directional attention mechanism in the T5 text encoder. Notably, the study reveals that knowledge about specific styles can be localized to even a small subset of neurons, highlighting the potential for neuron-level model editing.	The authors acknowledge limitations in applying closed-form edits to DeepFloyd and suggest exploring fast editing methods for models utilizing bi-directional attention as future work. Further research directions include investigating the generalizability of neuron-level editing beyond "style" to other attributes like "objects" and "facts".	diffusion_model, analysis, interpretability, text-to-image, model_editing
2402.10491	Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation	Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen	Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.	This paper introduces a novel self-cascade diffusion model that leverages a pre-trained low-resolution model to efficiently adapt to higher-resolution image and video generation tasks.	This paper addresses the challenge of computationally expensive fine-tuning required to adapt pre-trained diffusion models for higher-resolution generation. It proposes an efficient method that achieves significant training speed-up while maintaining generation quality, enabling wider application of diffusion models in high-resolution settings.	The authors propose two versions of their self-cascade diffusion model: a tuning-free version that utilizes a pivot-guided noise re-scheduling strategy to leverage the low-resolution model's knowledge, and a tuning version that incorporates learnable time-aware feature upsampler modules for improved detail with minimal fine-tuning on a small high-resolution dataset. They evaluate their method on both image and video generation tasks, comparing it to full fine-tuning and other adaptation techniques.	The self-cascade diffusion model demonstrates significant training speed-up (5x) compared to full fine-tuning, requiring minimal additional trainable parameters (0.002M) and negligible extra inference time. Experiments on image and video generation tasks show that it achieves state-of-the-art performance in both tuning-free and tuning settings, effectively adapting to higher resolutions while preserving the original model's generation capabilities and outperforming competing methods in terms of quality and efficiency.	The authors acknowledge that the limited capacity of the lightweight upsampler modules may pose limitations, especially for very large scale gaps. Future work may involve exploring the trade-off between adaptation efficiency and generalization ability, potentially by incorporating more sophisticated upsampling mechanisms or investigating alternative methods for knowledge transfer from the low-resolution model.	diffusion_model, image_generation, video_generation, high_resolution, adaptation, efficiency, self-cascade
2312.02133	Style Aligned Image Generation via Shared Attention	Amir Hertz, Andrey Voynov, Shlomi Fruchter, Daniel Cohen-Or	Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields, generating visually compelling outputs from textual prompts. However, controlling these models to ensure consistent style remains challenging, with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper, we introduce StyleAligned, a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing' during the diffusion process, our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method's evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity, underscoring its efficacy in achieving consistent style across various inputs.	This paper introduces StyleAligned, a method for generating sets of images with consistent styles from text prompts using pre-trained text-to-image diffusion models.	This paper is important because it offers a new approach to controlling the style of generated images in text-to-image synthesis, which has been a challenging problem. Existing methods often require expensive fine-tuning or struggle to maintain consistency across different prompts, while StyleAligned achieves this without any training or optimization.	StyleAligned works by introducing a shared attention mechanism into the diffusion process. When generating a set of images, each image attends to the features of a reference image, typically the first in the batch, during specific layers in the diffusion process. This attention sharing is further enhanced by using Adaptive Instance Normalization (AdaIN) to balance attention flow and improve style alignment.	The paper shows that StyleAligned outperforms existing T2I personalization methods, such as StyleDrop and DreamBooth, in terms of style consistency while maintaining good alignment with text prompts. Notably, it generates more coherent sets of images with shared stylistic elements, as evidenced by both qualitative examples and quantitative metrics using CLIP and DINO embeddings. Furthermore, the method is flexible and can be integrated with other diffusion-based techniques like ControlNet and MultiDiffusion, demonstrating its potential for various applications.	The paper acknowledges limitations in controlling the degree of shape and appearance similarity between generated images and highlights the need for improved diffusion inversion techniques. Future work could focus on these aspects and explore the use of StyleAligned for creating large, style-aligned datasets to train novel text-to-image models.	diffusion_model, style_transfer, image_generation, attention_mechanism, text-to-image, zero-shot, consistency, adain, controlnet, multidiffusion
2310.05916	Interpreting CLIP's Image Representation via Text-Based Decomposition	Yossi Gandelsman, Alexei A. Efros, Jacob Steinhardt	We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.	This paper investigates the internal structure of CLIP's image encoder, particularly the ViT-based variant, to understand how individual components like layers, attention heads, and image patches contribute to the final image representation.	This work is important because it provides a deeper understanding of how CLIP encodes information, which can be used to improve its performance on downstream tasks. By decomposing CLIP's representations and linking them to specific components and image regions, the authors offer insights into the model's decision-making process and pave the way for more interpretable and robust vision-language models.	The authors decompose CLIP's image representation into contributions from individual layers, attention heads, and image tokens. They leverage the residual structure of ViT to analyze direct contributions and develop an algorithm called TextSpan to associate text descriptions with the latent directions of each attention head. By analyzing these text descriptions and visualizing the contributions of different image regions, they uncover specific roles for many attention heads and reveal an emergent spatial localization within CLIP.	The paper demonstrates that the last few attention layers in CLIP-ViT have the most significant direct effect on the image representation. The authors also find that many attention heads specialize in capturing specific image properties like shape, color, or location. They leverage this finding to reduce spurious correlations in downstream classification tasks and achieve state-of-the-art performance on zero-shot semantic image segmentation.	The authors acknowledge limitations in addressing indirect effects between layers and the lack of clear roles for all attention heads. Future work could explore these indirect effects, analyze the collaborative roles of multiple heads, and extend the analysis to other CLIP architectures like ResNet.	diffusion_model, llm, analysis, interpretability, attention, clip, vit, zero-shot learning, segmentation, spurious correlations
2310.12103	Quality Diversity through Human Feedback	Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, Joel Lehman	Reinforcement Learning from Human Feedback (RLHF) has shown potential in qualitative tasks where clear objectives are lacking. However, its effectiveness is not fully realized when it is conceptualized merely as a tool to optimize average human preferences, especially in generative tasks that demand diverse model responses. Meanwhile, Quality Diversity (QD) algorithms excel at identifying diverse and high-quality solutions but often rely on manually crafted diversity metrics. This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach integrating human feedback into the QD framework. QDHF infers diversity metrics from human judgments of similarity among solutions, thereby enhancing the applicability and effectiveness of QD algorithms. Our empirical studies show that QDHF significantly outperforms state-of-the-art methods in automatic diversity discovery and matches the efficacy of using manually crafted metrics for QD on standard benchmarks in robotics and reinforcement learning. Notably, in a latent space illumination task, QDHF substantially enhances the diversity in images generated by a diffusion model and was more favorably received in user studies. We conclude by analyzing QDHF's scalability and the quality of its derived diversity metrics, emphasizing its potential to improve exploration and diversity in complex, open-ended optimization tasks. Source code is available on GitHub: https://github.com/ld-ing/qdhf.	This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach that integrates human feedback into Quality Diversity (QD) algorithms to automatically learn diversity metrics for optimizing the generation of diverse and high-quality solutions.	This paper is important because it addresses the limitations of existing QD algorithms that rely on manually crafted diversity metrics, which restricts their applicability in complex and open-ended tasks where defining such metrics is challenging. QDHF offers a more flexible and adaptable approach by leveraging human feedback to learn diversity metrics, potentially leading to improved exploration and diversity in various domains.	The authors propose an implementation of QDHF using latent space projection and contrastive learning. They first train a latent projection model to map solutions into a latent space, where each dimension represents a learned diversity metric. Then, they use human judgments on the similarity of solutions to fine-tune the latent projection model via contrastive learning, ensuring the learned diversity metrics align with human perception. They evaluate QDHF on three benchmark tasks: robotic arm control, maze navigation, and latent space illumination for image generation, comparing it against existing QD algorithms with unsupervised diversity discovery and ground truth metrics.	Experimental results demonstrate that QDHF significantly outperforms unsupervised diversity discovery methods in QD, achieving both higher quality and diversity in the generated solutions. Notably, in the latent space illumination task, QDHF successfully generates more diverse images while maintaining high quality compared to baseline methods. User studies further confirm that QDHF-generated images are perceived as more diverse and preferred by humans.	The authors acknowledge that the performance of QDHF relies on the accuracy of the learned latent projection model and the quality of human feedback. They suggest future work focusing on improving the generalization of the preference model used to collect human feedback, exploring strategies for efficient and diverse data collection, and applying QDHF to more complex and open-ended tasks in robotics, reinforcement learning, and generative modeling.	diffusion_model, analysis, 3d, motion, interpretability, quality_diversity, human_feedback, contrastive_learning, latent_space, image_generation
2312.07409	DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing	Kaiwen Zhang, Yifan Zhou, Xudong Xu, Xingang Pan, Bo Dai	Diffusion models have achieved remarkable image generation quality surpassing previous generative models. However, a notable limitation of diffusion models, in comparison to GANs, is their difficulty in smoothly interpolating between two image samples, due to their highly unstructured latent space. Such a smooth interpolation is intriguing as it naturally serves as a solution for the image morphing task with many applications. In this work, we present DiffMorpher, the first approach enabling smooth and natural image interpolation using diffusion models. Our key idea is to capture the semantics of the two images by fitting two LoRAs to them respectively, and interpolate between both the LoRA parameters and the latent noises to ensure a smooth semantic transition, where correspondence automatically emerges without the need for annotation. In addition, we propose an attention interpolation and injection technique and a new sampling schedule to further enhance the smoothness between consecutive images. Extensive experiments demonstrate that DiffMorpher achieves starkly better image morphing effects than previous methods across a variety of object categories, bridging a critical functional gap that distinguished diffusion models from GANs.	This paper introduces DiffMorpher, a novel approach leveraging pre-trained diffusion models like Stable Diffusion to generate smooth and natural image morphing sequences.	This paper is significant as it addresses a key limitation of diffusion models compared to GANs: their difficulty in smooth image interpolation, essential for realistic image morphing with various applications in animation, entertainment, and data augmentation.	DiffMorpher works by first fine-tuning two LoRAs to capture the semantics of two input images. Then, it interpolates between both the LoRA parameters and the latent noises obtained by DDIM inversion, ensuring smooth semantic and spatial transitions. It further incorporates attention interpolation and replacement for texture consistency, AdaIN adjustment for color coherence, and a new sampling schedule for uniform transition speed.	DiffMorpher demonstrates superior performance over existing image morphing methods, evidenced by lower FID, PPL, and a newly proposed PDV metric on their MorphBench dataset. The approach produces high-quality, semantically consistent, and smooth image morphing sequences for diverse objects and styles, confirmed by both qualitative and quantitative evaluations, including a user study.	Limitations include the need for LoRA training time for each image pair and reliance on text prompts. Future work could explore faster adaptation methods and incorporate correspondence information for challenging cases with unclear object alignment.	diffusion_model, image_morphing, lora, attention_mechanism, smooth_interpolation, stable diffusion, ddim, adain
2404.05331	Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt	Zhiqi Huang, Huixin Xiong, Haoyu Wang, Longguang Wang, Zhiheng Li	Text-to-image generation has witnessed great progress, especially with the recent advancements in diffusion models. Since texts cannot provide detailed conditions like object appearance, reference images are usually leveraged for the control of objects in the generated images. However, existing methods still suffer limited accuracy when the relationship between the foreground and background is complicated. To address this issue, we develop a framework termed Mask-ControlNet by introducing an additional mask prompt. Specifically, we first employ large vision models to obtain masks to segment the objects of interest in the reference image. Then, the object images are employed as additional prompts to facilitate the diffusion model to better understand the relationship between foreground and background regions during image generation. Experiments show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image while achieving better image quality. Comparison with previous text-to-image generation methods demonstrates our method's superior quantitative and qualitative performance on the benchmark datasets.	This paper introduces Mask-ControlNet, a novel framework for enhancing text-to-image generation quality using an additional mask prompt, aiming to improve object fidelity and foreground-background harmony.	This research is important because it addresses the limitations of existing text-to-image generation models in accurately replicating objects from reference images, particularly in complex compositions, and proposes a solution to enhance image quality and controllability.	The authors propose a two-stage framework: 1) Training phase: They train a diffusion model with a combination of text prompts, reference images, and object masks extracted using SAM. The model learns to generate images conditioned on these inputs. 2) Inference phase: Given a reference image and a text prompt, SAM segments the object, and the model generates an image adhering to the text prompt while maintaining fidelity to the segmented object.	The paper shows that using mask prompts leads to: - Improved object fidelity, preserving details and reducing distortions. - Better handling of complex foreground-background relationships, resulting in more harmonious compositions. - Quantitatively, Mask-ControlNet outperforms existing methods in FID, PSNR, SSIM, LPIPS, CLIP, and DINO scores. - Qualitatively, generated images exhibit higher visual quality and realism, as confirmed by user studies.	The paper does not explicitly mention limitations or future work. However, potential areas for improvement include: - Exploring different mask generation techniques beyond SAM to handle more complex scenes and object boundaries. - Investigating the generalization ability of the model to unseen object categories and diverse datasets. - Extending the framework to allow for more fine-grained control over object placement and relationships within the generated image.	diffusion_model, image_generation, object_reconstruction, mask, controllability, foreground-background, fidelity
2402.12004	Direct Consistency Optimization for Compositional Text-to-Image Personalization	Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin	Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. However, they still lack in synthesizing images of different scenarios or styles that are possible in the original pretrained models. To address this, we propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model. We devise a novel training objective for T2I diffusion models that minimally fine-tunes the pretrained model to achieve consistency. Our method, dubbed \emph{Direct Consistency Optimization}, is as simple as regular diffusion loss, while significantly enhancing the compositionality of personalized T2I models. Also, our approach induces a new sampling method that controls the tradeoff between image fidelity and prompt fidelity. Lastly, we emphasize the necessity of using a comprehensive caption for reference images to further enhance the image-text alignment. We show the efficacy of the proposed method on the T2I personalization for subject, style, or both. In particular, our method results in a superior Pareto frontier to the baselines. Generated examples and codes are in our project page( https://dco-t2i.github.io/).	This paper introduces Direct Consistency Optimization (DCO), a novel fine-tuning objective for Text-to-Image (T2I) diffusion models that improves personalized image generation by maximizing consistency to reference images while minimizing deviation from the pretrained model.	This paper is important because it addresses the limitations of current personalized T2I models, which often struggle to balance subject consistency with the ability to generate diverse images in different scenarios or styles. DCO offers a more principled approach to fine-tuning, resulting in more compositional and controllable image generation.	The authors formulate fine-tuning as a constrained policy optimization problem, encouraging the model to learn minimal information from reference images while retaining knowledge from the pretrained model. They derive an upper bound to this objective, leading to the DCO loss, which is as easy to implement as the standard diffusion loss. They also introduce a 'reward guidance' sampling method to control the trade-off between subject fidelity and text prompt fidelity and emphasize the importance of using comprehensive captions for reference images.	DCO outperforms baselines like DreamBooth and its variants in subject and style personalization tasks. Notably, DCO generates images with higher fidelity to both subjects and input text prompts, as evidenced by quantitative metrics and qualitative examples. It also enables the seamless composition of independently fine-tuned subject and style models without requiring additional post-processing steps like ZipLoRA.	The authors acknowledge the increased computational burden of DCO during both training and inference due to the additional forward passes through the pretrained model. They suggest exploring efficient fine-tuning methods to enhance scalability. Additionally, while cosine similarity was used to assess LoRA compatibility, the authors acknowledge the need for further investigation into metrics that accurately capture interference between LoRA models.	diffusion_model, t2i, personalization, fine-tuning, compositionality, image_generation, dreambooth, lora, consistency, reward_guidance
2403.14602	ReNoise: Real Image Inversion Through Iterative Noising	Daniel Garibi, Or Patashnik, Andrey Voynov, Hadar Averbuch-Elor, Daniel Cohen-Or	Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities. However, applying these methods to real images necessitates the inversion of the images into the domain of the pretrained diffusion model. Achieving faithful inversion remains a challenge, particularly for more recent models trained to generate images with a small number of denoising steps. In this work, we introduce an inversion method with a high quality-to-operation ratio, enhancing reconstruction accuracy without increasing the number of operations. Building on reversing the diffusion sampling process, our method employs an iterative renoising mechanism at each inversion sampling step. This mechanism refines the approximation of a predicted point along the forward diffusion trajectory, by iteratively applying the pretrained diffusion model, and averaging these predictions. We evaluate the performance of our ReNoise technique using various sampling algorithms and models, including recent accelerated diffusion models. Through comprehensive evaluations and comparisons, we show its effectiveness in terms of both accuracy and speed. Furthermore, we confirm that our method preserves editability by demonstrating text-driven image editing on real images.	This paper proposes ReNoise, a new diffusion model inversion method that enhances reconstruction accuracy and editability, especially for recent few-step models, without increasing computational cost.	This research is important because it addresses the limitations of existing inversion methods for real image editing with diffusion models, particularly in the context of few-step models which are essential for interactive editing workflows.	The authors developed ReNoise, a technique based on fixed-point iteration that refines the approximation of points along the forward diffusion trajectory during the inversion process. This is achieved by iteratively renoising the latent representation using the pre-trained diffusion model and averaging the resulting predictions. They also introduce techniques to enhance editability and correct noise in non-deterministic samplers.	ReNoise demonstrates superior reconstruction quality compared to existing sampler reversing methods, including DDIM inversion, for a fixed number of UNet operations. It also shows improved editability, enabling successful text-driven manipulations on real images, even with few-step models like SDXL Turbo and LCM LoRA. ReNoise is numerically stable, converges consistently, and outperforms other null-prompt inversion methods in terms of speed and accuracy.	The authors acknowledge the limitation of model-specific hyperparameter tuning for edit enhancement and noise correction in ReNoise. Future work includes more extensive testing with advanced editing methods and adapting ReNoise to video diffusion models.	diffusion_model, image_editing, inversion, few-step_models, analysis, ddim, sdxl turbo, lcm
2312.13558	The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction	Pratyusha Sharma, Jordan T. Ash, Dipendra Misra	Transformer-based Large Language Models (LLMs) have become a fixture in modern machine learning. Correspondingly, significant resources are allocated towards research that aims to further advance this technology, typically resulting in models of increasing size that are trained on increasing amounts of data. This work, however, demonstrates the surprising result that it is often possible to significantly improve the performance of LLMs by selectively removing higher-order components of their weight matrices. This simple intervention, which we call LAyer-SElective Rank reduction (LASER), can be done on a model after training has completed, and requires no additional parameters or data. We show extensive experiments demonstrating the generality of this finding across language models and datasets, and provide in-depth analyses offering insights into both when LASER is effective and the mechanism by which it operates.	This paper introduces LAyer-SElective Rank reduction (LASER), a technique for improving the performance of Large Language Models (LLMs) by selectively removing higher-order components from weight matrices in specific layers.	The paper is important because it challenges the conventional belief that larger models always perform better. It demonstrates a simple yet effective method to enhance LLM accuracy on various NLP and even reinforcement learning tasks without requiring additional training data or parameters.	The authors apply LASER by using Singular Value Decomposition (SVD) to identify and remove higher-order components from specific weight matrices of pre-trained LLMs. They experiment with different layers and reduction percentages, evaluating the impact on accuracy and other metrics across various datasets and LLM architectures.	LASER significantly improves accuracy on several NLP tasks, especially those involving less frequent information in the training data. For instance, GPT-J's accuracy on the CounterFact dataset increased from 13.3% to 24.1%. The technique also enhances robustness to paraphrases. Notably, LASER even benefits a Decision Transformer agent in a Sokoban environment, hinting at broader applicability beyond NLP.	The authors acknowledge limitations and propose future work on: (1) understanding why higher-order components accumulate noisy answers during training, (2) investigating the effect of model architecture on LASER's effectiveness, and (3) explaining the specific benefit of pruning later MLP layers. Further research is needed to explore alternative pruning methods and analyze the impact of LASER on language modeling and fluency in detail.	llm, analysis, svd, pruning, rank_reduction, question_answering, factuality, decision_transformer, reinforcement_learning
2402.06196	Large Language Models: A Survey	Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao	Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.	This paper presents a survey of Large Language Models (LLMs), covering their evolution from early neural language models, prominent LLM families (GPT, LLaMA, PaLM), techniques for building and augmenting LLMs, popular datasets and benchmarks, and an overview of performance comparisons.	This paper is important due to the rapid evolution and increasing influence of LLMs in various domains. It provides a comprehensive overview of LLM advancements, techniques, and challenges, serving as a valuable resource for researchers and practitioners seeking to understand and utilize LLMs effectively.	The paper conducts a literature review, summarizing key findings and advancements in the field of LLMs. It analyzes prominent LLM architectures, pre-training methods, fine-tuning and alignment techniques, and prompt engineering strategies. Additionally, it reviews popular datasets and benchmarks used for LLM evaluation, comparing the performance of notable models.	The survey highlights the impressive performance and capabilities of LLMs across various NLP tasks, including commonsense reasoning, code generation, and question answering. It showcases the benefits of prompt engineering techniques like Chain of Thought (CoT), Retrieval Augmented Generation (RAG), and the use of external tools to augment LLM functionality. The paper also emphasizes the importance of addressing challenges like hallucination, ethical concerns, and the need for smaller and more efficient LLM models.	The paper identifies several challenges and future research directions for LLMs, including the development of smaller and more efficient models, exploring new post-attention architectural paradigms, enhancing multi-modal capabilities, improving LLM usage and augmentation techniques, and addressing security and ethical concerns. It emphasizes the need for continued research in these areas to unlock the full potential of LLMs while mitigating their limitations.	llm, survey, gpt, llama, palm, transformer, pre-training, fine-tuning, alignment, prompt_engineering, rag, hallucination, ethical_ai, multi-modal, analysis, literature_review, code_generation, reasoning
2311.15657	Enhancing Diffusion Models with Text-Encoder Reinforcement Learning	Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin	Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as \textbf{TexForce}. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.	This paper introduces TexForce, a novel method to improve text-to-image diffusion models by fine-tuning the text encoder using reinforcement learning with low-rank adaptation (LoRA) and task-specific rewards, leading to better text-image alignment and higher visual quality.	This paper addresses the limitation of previous diffusion model fine-tuning methods that solely focus on the U-Net, neglecting the importance of the text encoder. It demonstrates that fine-tuning the text encoder is crucial for aligning generated images with text prompts, especially with limited training data, and shows its efficacy across different tasks and backbones.	The authors propose TexForce, which employs reinforcement learning, particularly the DDPO algorithm, to update the text encoder by maximizing task-specific rewards for generated images. They utilize LoRA for efficient fine-tuning and demonstrate its flexibility by combining LoRA weights from different tasks. Experiments are conducted with various prompt datasets, reward functions (ImageReward, HPSv2, face quality, hand detection confidence), and diffusion model backbones (SDv1.4, SDv1.5, SDv2.1).	TexForce significantly enhances text-image alignment and visual quality across various tasks, outperforming existing methods like DPOK, ReFL, and AlignProp. It shows robust performance on different backbones and the capability to combine with U-Net fine-tuning for further improvement. GPT-4V evaluation confirms its effectiveness in both aesthetics and text-coherence. Furthermore, the fusion of LoRA weights enables enhancement of specific objects within generated images.	The authors acknowledge limitations regarding sample efficiency and complexity of reward function engineering inherent to RL-based methods. They also raise concerns about potential misuse for misinformation and intellectual property infringement. Future work could address these limitations and explore broader applications of TexForce.	diffusion_model, text-to-image, reinforcement_learning, lora, text-image_alignment, image_quality, gpt-4v, face_generation, hand_generation
2309.03904	Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis	Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yujun Shen	Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development.	This paper introduces Aurora, a text-to-image GAN model that leverages Sparse Mixture of Experts (MoE) to enhance model capacity and generate high-quality images from text descriptions.	This paper is important because it addresses the limitations of GANs in text-to-image synthesis, particularly their difficulty in scaling up to handle complex datasets and open-vocabulary text prompts. By incorporating Sparse MoE, Aurora achieves comparable performance to diffusion models while maintaining faster generation speeds. The release of their code and checkpoints also provides a valuable resource for the research community to further explore and advance text-to-image generation with GANs.	The authors developed Aurora, a GAN-based text-to-image generator, incorporating a Sparse Mixture of Experts (MoE) approach. The generator uses CLIP to encode the input text and a mapping network to process both the text and a latent code. A series of generative blocks, each with a convolution block and an attention block, progressively increase the resolution of the generated image. The attention block employs MoE, utilizing a sparse router to select the most appropriate expert for each feature point based on both the input feature and text information. The model is trained progressively on LAION2B-en and COYO-700M datasets using a combination of adversarial loss, matching-aware loss, multi-level CLIP loss, and MoE loss. The authors use reference FID scores as an indicator to transition between training stages at different image resolutions.	Aurora achieves a 6.2 zero-shot FID score on MS COCO at 64x64 resolution, demonstrating its capability for open-vocabulary text-to-image synthesis. The authors also found that their sparse router effectively clusters pixels with similar visual concepts. Interestingly, they observed unexpected behavior during latent space interpolation, suggesting a potential research direction in disentangling text conditions and sampling stochasticity.	The paper acknowledges limitations in latent space interpolation, attributing them to the absence of perceptual path length regularization and potential dominance of text tokens over the global latent code. Future work includes investigating these issues, exploring better text information injection methods, and improving the model's performance and functionality using cleaner, higher-quality datasets.	diffusion_model, gan, text-to-image, image_synthesis, sparse_moe, attention, latent_space, open-vocabulary
2404.14367	Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data	Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar	Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. Different methods come with different implementation tradeoffs and performance differences, and existing empirical findings present different conclusions, for instance, some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient. This raises a natural question: what kind of approaches are important for fine-tuning with preference data and why? In this paper, we answer this question by performing a rigorous analysis of a number of fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a notion of mode-seeking objectives for categorical distributions. Mode-seeking objectives are able to alter probability mass on specific bins of a categorical distribution at a fast rate compared to maximum likelihood, allowing them to relocate masses across bins more effectively. Our analysis prescribes actionable insights for preference fine-tuning of LLMs and informs how data should be collected for maximal improvement.	This paper investigates the effectiveness of different fine-tuning methods for large language models (LLMs) on tasks involving binary preferences, particularly focusing on the roles of on-policy sampling and negative gradients.	This paper provides clarity on the effectiveness and trade-offs of various LLM fine-tuning methods, guiding practitioners in selecting the best approach for their specific preference optimization problem. It unifies seemingly distinct notions of on-policy sampling and negative gradients under the concept of mode-seeking objectives, which helps in understanding the behavior of different algorithms.	The authors conduct a rigorous empirical study using a variety of tasks, including didactic bandit problems, synthetic LLM problems with hand-crafted reward functions, and full-scale LLM fine-tuning problems with real human preference data from AlpacaFarm and UltraFeedback. They analyze the performance of different algorithms (PPO, REINFORCE, DPO, IPO, RWR, Pref-FT, Best-of-N) by varying the degree of on-policy sampling and use of negative gradients.	The key findings are that on-policy sampling significantly improves performance and efficiency, especially when the reward peak is far from the reference policy. Negative gradients are also beneficial, leading to faster convergence, and complement on-policy sampling. The study finds that both techniques are unified by the concept of mode-seeking divergences, which prioritize sharpening probability mass on high-reward regions, as opposed to mode-covering objectives like maximum likelihood.	The paper acknowledges limitations in terms of lacking rigorous statistical guarantees for the observed benefits of on-policy sampling and negative gradients. Future work could involve formalizing these benefits statistically. Further exploration could incorporate the role of pre-training distribution coverage, reward model quality, and recent minimax formulations in preference optimization.	llm, analysis, fine-tuning, preference_learning, reinforcement_learning, contrastive_learning, on-policy, negative_gradient, mode-seeking
2308.12605	APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency	Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng	Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.	This paper introduces APLA, a novel text-to-video generation network structure based on diffusion models, which leverages an additional compact network called Video Generation Transformer (VGT) to enhance the consistency of generated videos by extracting and utilizing inherent information from the input video.	This paper addresses the limitations of existing video generation diffusion models in maintaining consistency across frames, particularly in retaining local details. It proposes a novel approach using VGT and adversarial training to improve the temporal coherence and overall quality of generated videos, marking a significant step towards high-fidelity video generation.	The authors propose APLA, which adds VGT on top of pre-trained diffusion models. VGT, designed in two variants (pure Transformer decoder and a hybrid with 3D convolution), extracts inherent information from the input video. The authors introduce a hyper-loss function combining MSE, L1, and perceptual loss for better latent noise fitting. Furthermore, they incorporate adversarial training with a 1x1 convolutional discriminator to enhance the robustness and quality of the generated videos. Experiments were conducted on the DAVIS dataset, comparing APLA with existing methods using CLIP score and FCI metrics. Ablation studies were also performed to evaluate the impact of each component in APLA.	APLA demonstrates superior performance in generating consistent and high-quality videos compared to existing methods. Notably, it shows significant improvement in retaining local details across frames, addressing a key limitation of previous diffusion models. Quantitative evaluations using CLIP score and FCI confirm APLA's enhanced content and frame consistency, achieving state-of-the-art results. Ablation studies confirm that each component of APLA contributes to the overall performance, with the full model achieving the best results, showcasing the effectiveness of combining VGT, hyper-loss, and adversarial training.	The authors acknowledge limitations regarding the computational cost of APLA, which requires more time for inference compared to some existing methods. For future work, exploring more efficient architectures for VGT to reduce computational complexity is suggested. Additionally, investigating the generalization capabilities of APLA on a wider range of datasets and exploring its application to other video generation tasks, such as video prediction or video editing, could be promising directions.	diffusion_model, video, generation, t2v, consistency, transformer, adversarial_training
2403.13807	Editing Massive Concepts in Text-to-Image Diffusion Models	Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu	Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization for each individual concept with dual self-distillation from text alignment loss and diffusion noise prediction loss. The second stage conducts massive concept editing with multi-layer, closed form model editing. We further propose a comprehensive benchmark, named ImageNet Concept Editing Benchmark (ICEB), for evaluating massive concept editing for T2I models with two subtasks, free-form prompts, massive concept categories, and extensive evaluation metrics. Extensive experiments conducted on our proposed benchmark and previous benchmarks demonstrate the superior scalability of EMCID for editing up to 1,000 concepts, providing a practical approach for fast adjustment and re-deployment of T2I diffusion models in real-world applications.	This paper introduces EMCID, a two-stage method for editing large numbers of concepts in text-to-image diffusion models, addressing issues like outdated information, biases, and copyright infringement.	The paper is important because it offers a practical solution to mitigate problematic content generation in large diffusion models, which is crucial for their safe and responsible deployment in real-world applications.	EMCID first optimizes individual concept representations in the text encoder using dual self-distillation from text alignment and noise prediction losses. The second stage then aggregates these optimized representations and edits multiple layers of the model using a closed-form solution.	EMCID demonstrates superior scalability compared to previous methods, successfully editing up to 1,000 concepts while preserving the model's generation quality. It excels in updating, erasing, and rectifying concepts, as evidenced by extensive evaluations on the proposed ImageNet Concept Editing Benchmark (ICEB) and other benchmarks.	The authors acknowledge that EMCID might not effectively eliminate NSFW content generation, particularly from prompts with low toxicity. Future work could focus on addressing this limitation, potentially by combining EMCID with methods targeting other parts of the diffusion model.	diffusion_model, concept_editing, text-to-image, model_editing, large_scale, interpretability
2403.02580	What do we learn from inverting CLIP models?	Hamid Kazemi, Atoosa Chegini, Jonas Geiping, Soheil Feizi, Tom Goldstein	We employ an inversion-based approach to examine CLIP models. Our examination reveals that inverting CLIP models results in the generation of images that exhibit semantic alignment with the specified target prompts. We leverage these inverted images to gain insights into various aspects of CLIP models, such as their ability to blend concepts and inclusion of gender biases. We notably observe instances of NSFW (Not Safe For Work) images during model inversion. This phenomenon occurs even for semantically innocuous prompts, like "a beautiful landscape," as well as for prompts involving the names of celebrities.	This paper investigates the inner workings and potential biases of CLIP models by employing an inversion-based approach, generating images from text prompts to analyze CLIP's understanding of concepts, gender, and its proclivity to produce NSFW content.	This research is crucial as it provides insights into the often opaque training data and potential biases of widely used CLIP models, particularly highlighting the risk of generating NSFW content even from innocuous prompts, which has significant implications for downstream applications like text-to-image generation.	The authors invert CLIP models by optimizing images to closely align with given text prompts, utilizing techniques like random augmentations, ensembling, and regularization. They analyze the generated images for their ability to blend concepts, the presence of NSFW content, gender biases, and the impact of training data scale.	The study reveals that CLIP models can blend concepts effectively, often producing recognizable images from celebrity names. However, it also uncovers a concerning tendency to generate NSFW imagery, even from seemingly harmless prompts, including those related to landscapes and certain celebrities. This suggests the presence of a significant amount of NSFW content in the training data. Additionally, the research exposes gender biases within CLIP, as it associates specific professions and social statuses with particular genders. Lastly, it demonstrates that the scale of the training data directly influences the quality of the generated images, with larger datasets yielding better results.	The authors acknowledge the limitation of using generative methods to analyze a model not typically used for generation. Future work could involve exploring alternative methods to confirm these findings. Furthermore, the study emphasizes the need for better data filtering and curation during CLIP training to mitigate the generation of NSFW content and address inherent biases. Investigating methods to address the proximity of specific prompts to NSFW words in the embedding space is also crucial.	clip, analysis, nsfw, gender bias, model inversion, interpretability
2401.10020	Self-Rewarding Language Models	Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston	We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.	This text appears to be LaTeX code containing a series of custom commands and macros for mathematical notation and formatting commonly used in scientific papers, rather than a research paper itself. It defines various symbols, environments, and shortcuts to streamline the process of writing mathematical expressions and formatting figures, tables, and cross-references.	While not a paper, this code snippet is important as it showcases the tools and conventions that underpin the writing and typesetting of scientific documents, particularly in fields heavy in mathematical notation like physics, mathematics, computer science, and engineering. These macros help authors maintain consistency, improve readability, and reduce redundancy in their manuscripts.	N/A - This is not a research paper and does not have a methodology.	N/A - This is not a research paper and does not present results.	N/A - This is not a research paper and does not discuss limitations or future work.	latex, mathematics, formatting, scientific writing, macros
2309.07906	Generative Image Dynamics	Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski	We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics such as trees, flowers, candles, and clothes swaying in the wind. We model this dense, long-term motion prior in the Fourier domain:given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to realistically interact with objects in real pictures by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics.	This paper introduces a novel method for animating still images by predicting realistic, oscillatory motion using a learned image-space prior on scene dynamics.	This work is significant because it addresses the challenge of synthesizing realistic and temporally coherent motion in videos generated from single images, which is crucial for creating believable visual content.	The authors leverage spectral volumes, a frequency-domain representation of motion, and train a latent diffusion model to predict these volumes from single images. They then use an image-based rendering module to animate the input image according to the predicted motion.	The paper demonstrates superior quantitative and qualitative results compared to existing single-image animation methods, showing more realistic and temporally consistent video generation. The authors also showcase applications like seamless looping video generation and creating interactive dynamic images from single pictures.	The authors acknowledge limitations in modeling non-oscillatory or high-frequency motions, and potential issues with thin objects or large displacements. Future work could explore learned motion bases, handle complex motion patterns, and address challenges in generating unseen content.	diffusion_model, motion, video, analysis, 3d
2312.12148	Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment	Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, Fu Lee Wang	With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs), particularly the emergence of large language models (LLMs) with billions of parameters, many natural language processing (NLP) tasks have demonstrated remarkable success. However, the enormous size and computational demands of these models pose significant challenges for adapting them to specific downstream tasks, especially in environments with limited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning. The demands for fine-tuning PLMs, especially LLMs, have led to a surge in the development of PEFT methods, as depicted in Fig. 1. In this paper, we present a comprehensive and systematic review of PEFT methods for PLMs. We summarize these PEFT methods, discuss their applications, and outline future directions. Furthermore, we conduct experiments using several representative PEFT methods to better understand their effectiveness in parameter efficiency and memory efficiency. By offering insights into the latest advancements and practical applications, this survey serves as an invaluable resource for researchers and practitioners seeking to navigate the challenges and opportunities presented by PEFT in the context of PLMs.	This paper presents a comprehensive review and assessment of Parameter-Efficient Fine-Tuning (PEFT) methods for Pretrained Language Models (PLMs), focusing on their effectiveness in reducing trainable parameters and memory usage while maintaining comparable performance to full fine-tuning.	This paper is important because it addresses the challenges of adapting large language models (LLMs) with billions of parameters to specific downstream tasks, especially given limited computational resources, by providing a systematic overview of PEFT methods and evaluating their performance across different tasks and models.	The authors conducted their research by categorizing PEFT methods into five groups: additive fine-tuning, partial fine-tuning, reparameterized fine-tuning, hybrid fine-tuning, and unified fine-tuning. They then conducted experiments using eleven representative PEFT methods on three different types of PLMs (RoBERTa, T5, and LLaMA) across NLU, MT, and NLG tasks, evaluating their performance and memory usage.	Key findings include: (1) Most PEFT methods achieve comparable or better performance than full fine-tuning on the GLUE benchmark while significantly reducing the number of trainable parameters. (2) ProPELT adapter achieves the best average performance with only 1.5% of trainable parameters compared to full fine-tuning. (3) QLoRA significantly reduces GPU memory consumption, enabling fine-tuning of LLaMA with limited resources. (4) The effectiveness of PEFT methods in reducing memory usage increases with larger model sizes.	The paper highlights several limitations and future directions, including: (1) Exploring lightweight hybrid PEFT methods that combine multiple PEFT methods for better performance with minimal parameter increase. (2) Developing more LoRA-derived PEFT methods, focusing on pruning and weight quantization to optimize storage and computation. (3) Expanding the PEFT library by integrating additional PEFT methods for wider application. (4) Conducting further theoretical studies to understand the underlying mechanisms of PEFT methods. (5) Exploring the application of PEFT methods in computer vision and multimodal learning.	peft, llm, fine-tuning, parameter_efficiency, memory_efficiency, adapter, lora, prompt-tuning, prefix-tuning, analysis, literature_review, nlu, mt, nlg
2311.03648	Instruct Me More! Random Prompting for Visual In-Context Learning	Jiahao Zhang, Bowen Wang, Liangzhi Li, Yuta Nakashima, Hajime Nagahara	Large-scale models trained on extensive datasets, have emerged as the preferred approach due to their high generalizability across various tasks. In-context learning (ICL), a popular strategy in natural language processing, uses such models for different tasks by providing instructive prompts but without updating model parameters. This idea is now being explored in computer vision, where an input-output image pair (called an in-context pair) is supplied to the model with a query image as a prompt to exemplify the desired output. The efficacy of visual ICL often depends on the quality of the prompts. We thus introduce a method coined Instruct Me More (InMeMo), which augments in-context pairs with a learnable perturbation (prompt), to explore its potential. Our experiments on mainstream tasks reveal that InMeMo surpasses the current state-of-the-art performance. Specifically, compared to the baseline without learnable prompt, InMeMo boosts mIoU scores by 7.35 and 15.13 for foreground segmentation and single object detection tasks, respectively. Our findings suggest that InMeMo offers a versatile and efficient way to enhance the performance of visual ICL with lightweight training. Code is available at https://github.com/Jackieam/InMeMo.	This paper introduces Instruct Me More (InMeMo), a novel visual in-context learning method that enhances the performance of large-scale vision models by adding a learnable perturbation to in-context image pairs, thereby improving their instructive quality for downstream tasks like segmentation and object detection.	This paper is important because it addresses the limitations of existing visual in-context learning approaches that heavily rely on the quality and similarity of in-context pairs to query images. By introducing a learnable prompt, InMeMo improves the performance of visual in-context learning in a lightweight and efficient manner, achieving state-of-the-art results on benchmark tasks.	InMeMo first retrieves an in-context image pair similar to the query image. It then amends the pair with a learnable prompt enhancer module, which is trained to optimize the in-context pair for the specific downstream task. The enhanced pair, along with the query image, are then fed into a frozen pre-trained large-scale vision model (MAE-VQGAN) to generate a prediction for the given task. The prompt enhancer is trained in a supervised manner using cross-entropy loss on visual tokens, aiming to minimize the difference between predicted and ground-truth labels.	InMeMo achieves state-of-the-art results on foreground segmentation and single object detection tasks, surpassing previous visual in-context learning methods. It demonstrates robustness to domain shift and significant performance improvement even with limited training data. The paper provides extensive qualitative and quantitative results, demonstrating the efficacy of InMeMo in capturing fine-grained details and handling variations in image characteristics.	The paper acknowledges that InMeMo requires a minimum amount of training data per class to outperform the baseline. Additionally, the learnable prompt's generalizability to unseen classes is limited, necessitating task-specific training. Future work could focus on improving the generalizability of the learnable prompt and exploring its application in other downstream tasks.	diffusion_model, in-context learning, visual prompting, foreground segmentation, object detection, parameter-efficient transfer learning, domain shift, mae-vqgan
2404.07984	View Selection for 3D Captioning via Diffusion Ranking	Tiange Luo, Justin Johnson, Honglak Lee	Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object's characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.	This paper tackles the issue of hallucination in 3D object captioning, particularly in the Cap3D method, by introducing DiffuRank, a technique that uses a pre-trained text-to-3D model to rank rendered 2D views of 3D objects based on their alignment with the object's characteristics, resulting in more accurate and detailed captions.	This work is important because it addresses a key challenge in building large-scale 3D-text datasets: the generation of inaccurate captions due to the limitations of existing captioning models when presented with atypical or challenging views of 3D objects. By improving the accuracy and richness of 3D captions, this work can significantly benefit various 3D-related applications, including text-to-3D generation, image-to-3D conversion, robot learning, and 3D language model pre-training.	The authors developed DiffuRank, an algorithm that leverages a pre-trained text-to-3D diffusion model to assess the alignment between different rendered 2D views of a 3D object and the object itself. They generated multiple captions for each view using an image captioning model and fed them into the diffusion model alongside the 3D object's features. By ranking the views based on their average score (loss) in the diffusion model, they identified the views that best represent the object's 3D information. These top-ranked views were then passed to GPT4-Vision for generating the final captions.	The authors demonstrate that DiffuRank, in conjunction with GPT4-Vision, significantly improves the quality of captions for 3D objects. Key findings include: (1) DiffuRank effectively reduces hallucinations in captions, as evidenced by human studies and automated metrics. (2) Captions generated using DiffuRank are richer in detail and more accurate compared to those produced using all rendered views or a fixed set of horizontally placed views. (3) Using fewer but more informative views selected by DiffuRank can lead to better captions than using a large number of views indiscriminately. (4) DiffuRank can be extended to 2D domains and has shown promising results in Visual Question Answering tasks, outperforming CLIP on a challenging benchmark.	The authors acknowledge the limitations of DiffuRank, particularly its computational cost due to the need for rendering multiple views, generating captions for each view, and running inference through a diffusion model. The speed of DiffuRank is a bottleneck, especially for tasks involving numerous options, such as classification or image-text retrieval. Future work could focus on improving the efficiency of DiffuRank to make it more scalable for such tasks. Additionally, the authors suggest exploring the use of even more powerful text-to-3D and captioning models to further enhance the accuracy and detail of the generated captions. Expanding the dataset to encompass all of Objaverse-XL is another avenue for future work.	diffusion_model, llm, 3d, captioning, hallucination, view_selection, dataset, objaverse, gpt4-vision, visual_question_answering
2404.02285	LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP	Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, Ismail Ben Ayed	In a recent, strongly emergent literature on few-shot CLIP adaptation, Linear Probe (LP) has been often reported as a weak baseline. This has motivated intensive research building convoluted prompt learning or feature adaptation strategies. In this work, we propose and examine from convex-optimization perspectives a generalization of the standard LP baseline, in which the linear classifier weights are learnable functions of the text embedding, with class-wise multipliers blending image and text knowledge. As our objective function depends on two types of variables, i.e., the class visual prototypes and the learnable blending parameters, we propose a computationally efficient block coordinate Majorize-Minimize (MM) descent algorithm. In our full-batch MM optimizer, which we coin LP++, step sizes are implicit, unlike standard gradient descent practices where learning rates are intensively searched over validation sets. By examining the mathematical properties of our loss (e.g., Lipschitz gradient continuity), we build majorizing functions yielding data-driven learning rates and derive approximations of the loss's minima, which provide data-informed initialization of the variables. Our image-language objective function, along with these non-trivial optimization insights and ingredients, yields, surprisingly, highly competitive few-shot CLIP performances. Furthermore, LP++ operates in black-box, relaxes intensive validation searches for the optimization hyper-parameters, and runs orders-of-magnitudes faster than state-of-the-art few-shot CLIP adaptation methods. Our code is available at: \url{https://github.com/FereshteShakeri/FewShot-CLIP-Strong-Baseline.git}.	The paper introduces LP++, a novel method for few-shot CLIP adaptation that significantly improves upon the standard linear probe (LP) baseline by incorporating text embeddings via learnable class-wise blending parameters, leading to a surprising improvement in performance.	This paper is important as it challenges the established notion that LP is a weak baseline in few-shot CLIP adaptation. LP++ demonstrates that a simple, efficient, and black-box approach can achieve state-of-the-art results, outperforming more complex methods like prompt learning and adapters while being computationally efficient and not requiring access to internal representations of pre-trained models.	The authors propose a block coordinate Majorize-Minimize (MM) descent algorithm for optimizing a cross-entropy objective function, with data-driven learning rates derived from approximate Lipschitz constants, eliminating the need for extensive hyper-parameter search. Furthermore, they leverage insights from convex optimization to derive approximations of the loss function's minima, leading to data-informed initialization of the variables.	LP++ consistently outperforms the standard LP baseline and achieves competitive performance compared to state-of-the-art few-shot CLIP adaptation methods, particularly in low-shot scenarios. It runs orders of magnitude faster than prompt learning methods and avoids the need for intensive hyper-parameter tuning characteristic of adapter-based approaches. Furthermore, LP++ enables black-box adaptation, making it suitable for real-world, privacy-preserving situations where access to model internals is restricted.	The paper does not explicitly mention limitations or future work. However, potential future work could explore: (1) Applying LP++ to other vision-language tasks beyond image classification. (2) Investigating the impact of different text prompt designs and how to learn them in a data-driven manner. (3) Exploring different block-cycling strategies within the BMM procedure to further improve efficiency. (4) Investigating theoretical guarantees of convergence for LP++ under specific conditions.	diffusion_model, llm, analysis, few-shot learning, clip, optimization, black-box, linear_probe, image_classification
2312.09323	Perspectives on the State and Future of Deep Learning - 2023	Micah Goldblum, Anima Anandkumar, Richard Baraniuk, Tom Goldstein, Kyunghyun Cho, Zachary C Lipton, Melanie Mitchell, Preetum Nakkiran, Max Welling, Andrew Gordon Wilson	The goal of this series is to chronicle opinions and issues in the field of machine learning as they stand today and as they change over time. The plan is to host this survey periodically until the AI singularity paperclip-frenzy-driven doomsday, keeping an updated list of topical questions and interviewing new community members for each edition. In this issue, we probed people's opinions on interpretable AI, the value of benchmarking in modern NLP, the state of progress towards understanding deep learning, and the future of academia.	This paper presents a collection of opinions from prominent machine learning researchers on the current state and future directions of the field, covering topics like interpretability, benchmarking, the limitations of current paradigms, and the role of academia.	This paper offers valuable insights into the minds of leading experts in machine learning, highlighting key challenges and opportunities that are shaping the field's trajectory. It provides a glimpse into the future of AI research and its potential impact.	The authors conducted a survey, presenting a series of open-ended questions to prominent figures in the machine learning community. The interviewees provided their individual perspectives and insights on each topic.	Some key findings include a consensus that current benchmarking practices are inadequate for capturing complex model behaviors like common sense. There's also debate on the interpretability of deep learning models, with some believing in its eventual achievement and others expressing skepticism. Additionally, researchers emphasize the need to move beyond scaling existing models and focus on developing new learning paradigms with stronger inductive biases.	The paper acknowledges the limitations of current deep learning approaches, particularly concerning data efficiency and the lack of robust theoretical understanding. It suggests exploring alternative architectures, integrating planning into learning algorithms, and emphasizing multimodal learning as promising future directions.	analysis, llm, interpretability, benchmarking, deep_learning, transformers, future_of_ai
2404.05014	MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators	Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo	Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions. A largely overlooked problem in T2V is that existing models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose \textbf{MagicTime}, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic generation. First, we design a MagicAdapter scheme to decouple spatial and temporal training, encode more physical knowledge from metamorphic videos, and transform pre-trained T2V models to generate metamorphic videos. Second, we introduce a Dynamic Frames Extraction strategy to adapt to metamorphic time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes, thus embodying more physical knowledge than general videos. Finally, we introduce a Magic Text-Encoder to improve the understanding of metamorphic video prompts. Furthermore, we create a time-lapse video-text dataset called \textbf{ChronoMagic}, specifically curated to unlock the metamorphic video generation ability. Extensive experiments demonstrate the superiority and effectiveness of MagicTime for generating high-quality and dynamic metamorphic videos, suggesting time-lapse video generation is a promising path toward building metamorphic simulators of the physical world.	This paper introduces MagicTime, a novel approach for generating metamorphic time-lapse videos by incorporating physical knowledge into text-to-video generation models. It leverages time-lapse videos, which capture complete object transformations, to enhance the model's understanding of real-world physics and enable the generation of videos depicting complex phenomena like melting, blooming, or construction.	This paper is important because it addresses a significant limitation in current text-to-video generation models: the lack of encoding of real-world physical knowledge. This limitation restricts these models to generating videos with simple motions and limits their ability to depict complex, transformative processes. MagicTime tackles this issue by incorporating time-lapse video data and specialized training strategies, paving the way for more realistic and dynamic video generation.	The authors propose MagicTime, a framework that modifies pre-trained text-to-video diffusion models to generate metamorphic time-lapse videos. Key components include: 1) MagicAdapter: decouples spatial and temporal training to encode physical knowledge from metamorphic videos, 2) Dynamic Frames Extraction: adapts to the characteristics of time-lapse videos and prioritizes metamorphic features, and 3) Magic Text-Encoder: refines prompt understanding for metamorphic videos. Additionally, the authors create ChronoMagic, a new dataset of time-lapse videos with detailed captions, to train and evaluate MagicTime.	MagicTime generates high-quality metamorphic videos that capture complex transformations and align with textual prompts. It outperforms existing text-to-video generation methods in both qualitative and quantitative evaluations, demonstrating superior visual quality, frame consistency, and text alignment. The authors also conduct ablation studies to validate the contribution of each component in MagicTime.	The authors acknowledge limitations in evaluating generative models for metamorphic videos due to the lack of established metrics beyond FID, FVD, and CLIP Similarity. They plan to investigate more comprehensive evaluation metrics in future work. Additionally, the authors are exploring the integration of MagicTime with DiT-based architectures, such as Open-Sora-Plan, to further enhance metamorphic video generation capabilities.	diffusion_model, video, generation, time-lapse, metamorphic, physics, dataset, magictime, chronomagic
2403.18103	Tutorial on Diffusion Models for Imaging and Vision	Stanley H. Chan	The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some shortcomings that were deemed difficult in the previous approaches. The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems.	This tutorial provides a comprehensive overview of diffusion models for imaging and vision, focusing on the core concepts and mathematical foundations behind these models, such as Variational Autoencoders (VAEs), Denoising Diffusion Probabilistic Models (DDPMs), Score-Matching Langevin Dynamics (SMLDs), and Stochastic Differential Equations (SDEs).	Diffusion models have revolutionized generative AI, enabling remarkable applications in text-to-image and text-to-video generation. This tutorial is crucial for understanding the inner workings of these models and for researchers and students aiming to contribute to this burgeoning field or apply diffusion models in various domains.	The paper employs a step-by-step approach, beginning with the fundamentals of VAEs and progressively introducing more sophisticated concepts like DDPMs, SMLDs, and SDEs. Each section offers clear explanations, illustrative examples, mathematical derivations, and connections between different perspectives. The paper also discusses training and inference procedures for each model, highlighting the role of denoisers, score functions, and noise schedules.	The tutorial effectively elucidates that diffusion models achieve their remarkable performance through incremental updates, gradually transforming noise into coherent data samples. The equivalence between denoising score matching and explicit score matching is a key result, justifying the use of denoisers in diffusion models. The connection between discrete-time diffusion iterations and continuous-time SDEs provides a unifying framework for analyzing and comparing different diffusion models.	The tutorial points out that while iterative denoising is currently dominant, it may not be the definitive solution for image generation. Future research could explore more biologically plausible generative processes and address the computational cost associated with diffusion models. The justification for using non-Gaussian noise distributions is also a potential area for investigation.	diffusion_model, vae, ddpm, smld, sde, analysis, tutorial, image_generation, denoising
2311.13127	MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning	Yixin Liu, Chenrui Fan, Yutong Dai, Xun Chen, Pan Zhou, Lichao Sun	Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Yet, these tools, in the wrong hands, can fabricate misleading or harmful content, endangering individuals. To address this problem, existing poisoning-based approaches perturb user images in an imperceptible way to render them "unlearnable" from malicious uses. We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges, we propose MetaCloak, which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. Specifically, we employ a pool of surrogate diffusion models to craft transferable and model-agnostic perturbation. Furthermore, by incorporating an additional transformation process, we design a simple denoising-error maximization loss that is sufficient for causing transformation-robust semantic distortion and degradation in a personalized generation. Extensive experiments on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing approaches. Notably, MetaCloak can successfully fool online training services like Replicate, in a black-box manner, demonstrating the effectiveness of MetaCloak in real-world scenarios. Our code is available at https://github.com/liuyixin-louis/MetaCloak.	This paper presents MetaCloak, a novel method for protecting user images from unauthorized personalized image generation using DreamBooth by crafting robust perturbations that can withstand data transformations.	The paper addresses the growing privacy concern of unauthorized use of personal images for AI-generated content, specifically targeting the vulnerabilities of personalized diffusion models like DreamBooth.	The authors propose a meta-learning framework to craft transferable and model-agnostic perturbations by training over a pool of surrogate diffusion models. To enhance robustness against data transformations, they incorporate a transformation sampling process during perturbation crafting and utilize a denoising-error maximization loss to introduce semantic distortion.	MetaCloak outperforms existing methods in protecting images under both standard training and training with data transformations, as evidenced by quantitative metrics and qualitative visualizations. It effectively degrades subject detection scores, semantic similarity, and generated image quality. Notably, MetaCloak demonstrates effectiveness in real-world scenarios by successfully fooling online training services like Replicate.	The paper acknowledges limitations in terms of potential vulnerability to advanced adversarial purification techniques and reduced effectiveness under low poisoning ratios. Future work suggestions include investigating mechanisms to further improve stealthiness, particularly under large perturbation radii, and exploring methods for effective protection under low poisoning rates.	diffusion_model, gan, adversarial_attack, interpretability, data_protection, privacy, dreambooth, poisoning_attack
2402.18956	WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts	Yong Hyun Ahn, Hyeon Bae Kim, Seong Tae Kim	Recent advancements in neural networks have showcased their remarkable capabilities across various domains. Despite these successes, the "black box" problem still remains. Addressing this, we propose a novel framework, WWW, that offers the 'what', 'where', and 'why' of the neural network decisions in human-understandable terms. Specifically, WWW utilizes adaptive selection for concept discovery, employing adaptive cosine similarity and thresholding techniques to effectively explain 'what'. To address the 'where' and 'why', we proposed a novel combination of neuron activation maps (NAMs) with Shapley values, generating localized concept maps and heatmaps for individual inputs. Furthermore, WWW introduces a method for predicting uncertainty, leveraging heatmap similarities to estimate 'how' reliable the prediction is. Experimental evaluations of WWW demonstrate superior performance in both quantitative and qualitative metrics, outperforming existing methods in interpretability. WWW provides a unified solution for explaining 'what', 'where', and 'why', introducing a method for localized explanations from global interpretations and offering a plug-and-play solution adaptable to various architectures.	This paper introduces WWW, a novel framework designed to explain neural network decisions by revealing 'what' concept a neuron represents, 'where' in the input image the concept is located, and 'why' the concept contributes to the prediction.	The paper addresses the "black box" problem in neural networks, aiming to make their decision-making process more transparent and understandable to humans. This is crucial for building trust and reliability in AI systems, especially given the increasing demand for explainable AI in various domains.	WWW comprises three modules: 1) Concept Discovery identifies concepts represented by each neuron using adaptive cosine similarity and adaptive selection. 2) Localization identifies relevant input regions for each concept by combining neuron activation maps with Shapley values. 3) Reasoning identifies important neurons for both the predicted class and the specific input sample, highlighting differences to understand prediction reliability.	WWW demonstrates superior performance in both qualitative and quantitative evaluations. It outperforms existing methods in accurately identifying neuron concepts, particularly with larger concept sets. The paper also shows that heatmap similarity, derived from the framework, can be a more effective measure of prediction uncertainty compared to maximum softmax probability.	The paper acknowledges limitations in accurately identifying neuron concepts when only a few example images are available. Future work will focus on improving concept discovery by exploring different example selection strategies and concept representations. Another direction is exploring the use of heatmap similarity for misprediction detection and model improvement.	interpretability, explanation, neural network, concept discovery, shapley value, neuron activation map, heatmap, uncertainty, analysis
2308.08947	Watch Your Steps: Local Image and Scene Editing by Text Instructions	Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski	Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: https://ashmrz.github.io/WatchYourSteps/	This paper presents a method for localizing image and scene edits by leveraging the discrepancy between noise predictions of a diffusion-based image editor with and without text instructions, resulting in a relevance map to guide the editing process.	This paper addresses the limitations of existing diffusion-based image editors, particularly their tendency to over-edit. By introducing relevance maps, the method allows for precise control over the editing process, preserving irrelevant regions while ensuring the desired changes are applied effectively to both images and 3D scenes represented as neural radiance fields.	The authors propose a relevance map calculation by measuring the difference between noise predictions from InstructPix2Pix (IP2P) with and without the edit instruction. This map, after binarization, guides the IP2P denoising process to confine edits within the relevant region. For 3D scene editing, a relevance field is trained on relevance maps of training views to maintain 3D consistency, guiding iterative updates on the scene.	The method demonstrates state-of-the-art performance in both image and NeRF editing tasks. It outperforms baselines in preserving image consistency while achieving comparable edit quality. The relevance maps effectively guide the editing process, preventing over-editing and ensuring the edits are applied to the desired regions. The method produces sharper and higher-quality results compared to previous approaches, particularly in the context of NeRF editing.	The authors acknowledge the method's reliance on IP2P, inheriting its limitations. Cases where IP2P fails to interpret the instruction or localize the edit properly pose challenges. Future work could explore better instruction-conditioned diffusion models and address ambiguities in localizing edits for broader applications.	diffusion_model, image_editing, 3d, nerf, relevance_map, text-guided, scene_editing, localization
2403.11027	Reward Guided Latent Consistency Distillation	Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang	Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM's efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM's output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM's single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25 times inference acceleration without quality loss. As directly optimizing towards differentiable RMs can suffer from over-optimization, we overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved FID on MS-COCO and a higher HPSv2.1 score on HPSv2's test set, surpassing those achieved by the baseline LCM.	This paper introduces Reward Guided Latent Consistency Distillation (RG-LCD), a method for enhancing the efficiency and quality of text-to-image synthesis by incorporating feedback from a reward model (RM) into the Latent Consistency Distillation (LCD) process.	This paper is important because it addresses the limitations of current Latent Consistency Models (LCMs) for text-to-image synthesis, which prioritize inference speed over sample quality. By integrating human preference through RMs, RG-LCD improves LCMs' generated image quality without sacrificing inference speed.	The authors propose RG-LCD, which integrates feedback from a differentiable RM into the LCD process by augmenting the original LCD loss with a reward maximization objective. To avoid reward over-optimization, they introduce a latent proxy RM (LRM) that connects the LCM to the RM, enabling indirect optimization of the expert RM and allowing learning from non-differentiable RMs. They conduct experiments using different RMs (CLIPScore, HPSv2.1, PickScore, ImageReward) and evaluate the generated images with human evaluation and automatic metrics like HPSv2.1 score and FID.	Human evaluation shows that the 2-step generations from RG-LCM (HPS) are preferred over the 50-step DDIM generations from the teacher LDM, indicating a 25x speedup without quality loss. RG-LCM (CLIP), despite using a non-preference-trained RM, also outperforms the teacher LDM in 4-step generations. The study found that using an LRM effectively mitigates reward over-optimization, leading to more visually appealing images and addressing the high-frequency noise issue observed when directly optimizing for certain RMs like ImageReward. Interestingly, the results also reveal discrepancies between human preferences and automatic metric scores, suggesting current metrics like HPSv2.1 may not fully capture human preferences, particularly concerning high-frequency noise due to the use of image resizing during evaluation.	The authors acknowledge limitations in existing automatic metrics for evaluating image quality and call for the development of more robust metrics that eliminate image resizing in their evaluation process. They also suggest exploring the use of LRMs to learn human preferences directly in the latent space as a potential solution. Future work could involve investigating alternative LRM architectures, exploring different reward models and datasets, and applying RG-LCD to other generative modeling tasks beyond text-to-image synthesis.	diffusion_model, consistency_distillation, text-to-image, reward_model, image_generation, inference_acceleration, latent_space
2311.10770	Exponentially Faster Language Modelling	Peter Belcak, Roger Wattenhofer	Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.	This paper introduces UltraFastBERT, a variant of the BERT language model that replaces standard feedforward networks with fast feedforward networks (FFFs). UltraFastBERT achieves comparable performance to BERT on downstream tasks while using only a small fraction (0.3%) of its neurons for each inference.	This work is significant because it demonstrates the potential of conditional neural execution for significant speed improvements in large language models. By showing that only a small portion of neurons are necessary for individual inferences, it challenges the current paradigm of dense computation in these models and opens the door for more efficient implementations.	The authors developed UltraFastBERT by replacing the feedforward layers in crammedBERT with FFFs, organizing neurons into a binary tree and conditionally activating only one branch per inference. They trained various UltraFastBERT configurations on the GLUE benchmark, comparing their performance against BERT-base and crammedBERT. They also implemented and evaluated different CPU and GPU inference implementations to assess the speedup from using FFFs.	UltraFastBERT achieved comparable performance to BERT-base on the GLUE benchmark, retaining at least 96% of its performance while using only 0.3% of the neurons for inference. The naive implementation of conditional matrix multiplication (CMM) in FFFs resulted in a speedup of up to 78x on CPUs over standard feedforward layers. While a fully optimized CMM implementation is not yet available, the results highlight the potential for significant speed improvements in language modeling.	The authors acknowledge the limitations in the current implementation of CMM, which relies on high-level linear algebra routines and lacks support for efficient vector-level sparsity. Future work includes developing native and optimized implementations of CMM for both CPUs and GPUs, potentially by introducing hybrid vector-level sparse tensors in deep learning libraries and dedicated device programming interfaces. This would enable fully realizing the potential speedup demonstrated by UltraFastBERT.	llm, bert, diffusion_model, analysis, performance, optimization, conditional_computation
2404.15653	CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data	Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi, Mehrdad Farajtabar, Oncel Tuzel, Mohammad Rastegari	Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{https://github.com/apple/corenet}.	This paper introduces \method, a novel weakly supervised approach for pre-training vision models on web-scale image-text data by reframing it as a classification task, achieving a 2.7x speedup over contrastive learning methods like CLIP while maintaining comparable downstream performance.	This paper is important because it addresses the computational bottleneck of contrastive learning in image-text pre-training, making it significantly faster and more efficient without compromising accuracy. This is crucial for enabling wider access to and faster research in large-scale pre-training.	The authors extracted nouns from text captions, mapped them to WordNet synsets, and trained vision models using binary cross-entropy loss, essentially treating pre-training as a multi-label classification problem. They experimented with various ViT backbones, scaling data and models, and compared their method to CLIP on downstream tasks like image classification, multi-label classification, semantic segmentation, and object detection.	Key findings include: (1) \method is 2.7x faster than CLIP while achieving comparable accuracy. (2) Scaling data and model size in \method improves downstream performance. (3) \method enables data-efficient transfer learning by leveraging the pre-trained classifier for initialization. (4) \method generalizes well to complex visual tasks like multi-label classification, semantic segmentation, and object detection, demonstrating the quality of learned representations.	The paper acknowledges that while \method achieves promising results, the performance of the largest ViT model starts to saturate on larger datasets, suggesting potential limitations in scaling. Future work could explore longer training, leveraging even larger datasets, or incorporating techniques from contrastive learning to further improve \method's performance.	diffusion_model, analysis, image_classification, multi-label_classification, semantic_segmentation, object_detection, weakly_supervised_learning, pre-training, vision_transformer, data_efficiency, web-scale_data
2404.05595	UniFL: Improve Stable Diffusion via Unified Feedback Learning	Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Weilin Huang, Min Zheng, Lean Fu, Guanbin Li	Diffusion models have revolutionized the field of image generation, leading to the proliferation of high-quality models and diverse downstream applications. However, despite these significant advancements, the current competitive solutions still suffer from several limitations, including inferior visual quality, a lack of aesthetic appeal, and inefficient inference, without a comprehensive solution in sight. To address these challenges, we present UniFL, a unified framework that leverages feedback learning to enhance diffusion models comprehensively. UniFL stands out as a universal, effective, and generalizable solution applicable to various diffusion models, such as SD1.5 and SDXL. Notably, UniFL incorporates three key components: perceptual feedback learning, which enhances visual quality; decoupled feedback learning, which improves aesthetic appeal; and adversarial feedback learning, which optimizes inference speed. In-depth experiments and extensive user studies validate the superior performance of our proposed method in enhancing both the quality of generated models and their acceleration. For instance, UniFL surpasses ImageReward by 17% user preference in terms of generation quality and outperforms LCM and SDXL Turbo by 57% and 20% in 4-step inference. Moreover, we have verified the efficacy of our approach in downstream tasks, including Lora, ControlNet, and AnimateDiff.	This paper introduces UniFL, a novel unified feedback learning framework for improving text-to-image diffusion models. UniFL aims to address limitations in existing models, such as inferior visual quality, lack of aesthetic appeal, and inefficient inference.	This paper is important because it presents a comprehensive solution to improve text-to-image diffusion models in multiple aspects. By leveraging various feedback learning techniques, UniFL enhances the visual quality, aesthetic appeal, and inference speed of diffusion models, which are crucial for broader applications and user satisfaction.	UniFL achieves its goals through three key components: (1) Perceptual Feedback Learning (PeFL) leverages existing visual perception models (e.g., VGG, instance segmentation models) to enhance specific visual aspects like style and structure. (2) Decoupled Feedback Learning utilizes separate reward models for different aesthetic dimensions (e.g., color, layout, lighting, detail) and incorporates an active prompt selection strategy to mitigate overfitting. (3) Adversarial Feedback Learning treats the reward model as a discriminator in adversarial training, enabling optimization for faster inference without sacrificing quality.	UniFL demonstrates superior performance in both quantitative and qualitative evaluations. It outperforms competitive methods like ImageReward, DreamShaper, and DPO in terms of FID, CLIP Score, and aesthetic scores on SD1.5 and SDXL architectures. User studies confirm UniFL's superiority in generation quality and acceleration, surpassing LCM, SDXL-Turbo, and SDXL-Lightning. Notably, UniFL shows promising generalization capabilities, effectively transferring its improvements to downstream tasks like LoRA, ControlNet, and AnimateDiff.	The authors identify several limitations and future work directions: exploring larger and more advanced visual perception models for enhanced supervision, further improving acceleration towards one-step inference, and streamlining the current two-stage optimization process into a single-stage approach.	diffusion_model, feedback_learning, acceleration, aesthetic, quality, inference, text-to-image, perceptual_loss, adversarial_training
2403.12963	FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis	Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, Hongsheng Li	In this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their trained resolutions. To address this issue, we introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis. We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation, intending to achieve structural consistency and scale consistency across resolutions, respectively. Further enhanced by a padding-then-crop strategy, our method can flexibly handle text-to-image generation of various aspect ratios. By using the FouriScale as guidance, our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation. With its simplicity and compatibility, our method can provide valuable insights for future explorations into the synthesis of ultra-high-resolution images. The code will be released at https://github.com/LeonHLJ/FouriScale.	This paper introduces FouriScale, a training-free method to enable pre-trained diffusion models to synthesize high-resolution images without repetitive patterns or structural distortions.	The paper addresses a critical limitation of diffusion models, which are typically trained at fixed resolutions, hindering their ability to generate high-quality images at arbitrary sizes. FouriScale offers a simple yet effective solution to this problem, making it highly relevant for various applications requiring high-resolution image generation.	FouriScale modifies the convolutional layers within the diffusion model's UNet architecture. It replaces standard convolutions with a combination of dilated convolutions and low-pass filtering to achieve structural and scale consistency across resolutions. It utilizes a padding-then-cropping strategy to generate images with arbitrary aspect ratios and introduces FouriScale guidance for enhanced image quality.	FouriScale effectively mitigates pattern repetition and distortions in high-resolution image synthesis, outperforming other training-free methods like Attn-Entro and ScaleCrafter. It exhibits consistent performance across different pre-trained models like SD 1.5, SD 2.1, and SDXL, demonstrating its robustness and generalizability. Quantitative evaluations using FID and KID demonstrate its superior performance over baselines.	The authors acknowledge that FouriScale encounters limitations in generating ultra-high-resolution images (e.g., 4096x4096) where artifacts may arise. Additionally, its reliance on convolutional operations restricts its application to purely transformer-based diffusion models. Future work may explore extending FouriScale for ultra-high resolution and adapting it for transformer architectures.	diffusion_model, image_synthesis, high_resolution, training-free, frequency_domain, convolutional_neural_networks, generative_models
2310.16834	Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution	Aaron Lou, Chenlin Meng, Stefano Ermon	Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by $25$-$75$\%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around $6$-$8\times$ better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with $32\times$ fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).	This paper introduces Score Entropy Discrete Diffusion (SEDD), a novel approach for building discrete diffusion models parameterized by the ratios of the data distribution, aiming to address the limitations of existing diffusion models in handling discrete data like natural language.	The paper is important because it presents a novel method for discrete diffusion models that outperforms previous models in language modeling tasks, challenges the dominance of autoregressive models, and offers advantages like faster, controllable, and higher-quality generation without relying on distribution annealing techniques.	The authors develop a novel loss function called score entropy, analogous to score matching used in continuous diffusion models. They use this loss to train a seq-to-seq transformer model on various language modeling tasks like text8, One Billion Words, and GPT-2 zero-shot tasks. They evaluate their model's performance on perplexity and generation quality, comparing it against existing diffusion models and autoregressive models like GPT-2.	SEDD significantly outperforms previous discrete diffusion models on language modeling benchmarks and achieves competitive perplexity scores compared to autoregressive models, even surpassing GPT-2 on some tasks. Furthermore, SEDD generates higher-quality text without distribution annealing techniques and allows for flexible conditional generation, including infilling, matching the performance of models that rely on such techniques.	The paper acknowledges limitations such as the gap with modern large language models and the need for exploring better distribution annealing techniques for SEDD. Future work could focus on closing the performance gap with larger LMs, adapting empirical designs from continuous diffusion models, and systematically exploring noise schedules and loss weightings for further improvement.	diffusion_model, llm, analysis, language_modeling, text_generation
2404.02883	On the Scalability of Diffusion-based Text-to-Image Generation	Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto	Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.	This paper investigates the scaling properties of diffusion-based text-to-image models, focusing on the denoising backbone and dataset size to understand how to design and train these models effectively.	This work is important because it provides insights into the design and training of large-scale text-to-image models, which are computationally expensive to develop. The findings offer practical guidance for improving performance and efficiency in this domain.	The authors conducted controlled experiments by training various UNet and Transformer architectures with different sizes and configurations. They also curated and used large-scale datasets, analyzing the impact of dataset size, quality, and caption enhancement on model performance. Key metrics like TIFA, ImageReward, FID, CLIP score, and HPSv2 were used to evaluate the models.	The paper demonstrates that SDXL's UNet design is superior to its counterparts, and strategically increasing its transformer depth is more parameter-efficient for better text-image alignment than solely increasing channel numbers. Additionally, they identified an efficient UNet variant with 45% fewer parameters and 28% faster inference than SDXL, achieving comparable performance. The study also highlights that dataset quality matters more than size, and augmenting datasets with synthetic captions significantly improves training efficiency and performance.	The paper acknowledges limitations in training Transformers from scratch due to a lack of inductive bias compared to UNets, suggesting further exploration of architectural improvements for Transformers in future work. Additionally, while the study provides valuable insights into scaling laws for text-to-image models, it acknowledges the need for further investigation with even larger models and datasets.	diffusion_model, gan, analysis, text-to-image, unet, transformer, scaling_law, dataset, caption, efficiency
2401.12945	Lumiere: A Space-Time Diffusion Model for Video Generation	Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri	We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.	This paper introduces Lumiere, a text-to-video diffusion model that synthesizes videos with realistic and coherent motion by generating the entire temporal duration at once using a novel Space-Time U-Net architecture.	This paper addresses a critical challenge in video synthesis: generating videos with realistic and coherent motion over extended durations. It deviates from the prevalent cascaded approach and proposes a novel architecture that significantly improves the quality and coherence of generated motion in videos.	The authors propose a Space-Time U-Net (STUNet) that processes and generates the entire video simultaneously by downsampling and upsampling the video in both space and time. This architecture leverages a pre-trained text-to-image diffusion model and employs Multidiffusion for spatial super-resolution to ensure temporal consistency.	Lumiere demonstrates state-of-the-art results in text-to-video generation, producing high-quality videos with superior motion coherence and visual fidelity compared to existing methods. It also exhibits strong performance in various downstream tasks, including image-to-video generation, video inpainting, and stylized generation.	The paper acknowledges limitations in generating multi-shot videos or those involving scene transitions. Future work could explore extending Lumiere to address these limitations and investigate its application to latent video diffusion models.	diffusion_model, video, motion, text-to-video, video_generation, image-to-video, video_inpainting, stylized_generation
2311.01462	Idempotent Generative Network	Assaf Shocher, Amil Dravid, Yossi Gandelsman, Inbar Mosseri, Michael Rubinstein, Alexei A. Efros	We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely $f(f(z))=f(z)$. The proposed model $f$ is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely $f(x)=x$. We define the target manifold as the set of all instances that $f$ maps to themselves. (2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, $f(f(z))=f(z)$ which encourages the range of $f(z)$ to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a ``global projector'' that enables projecting any input into a target data distribution.	This paper introduces Idempotent Generative Networks (IGN), a novel approach to generative modeling that trains a neural network to be idempotent, meaning applying it repeatedly yields the same result as the initial application.	This paper is significant because it presents a new perspective on generative modeling with unique advantages: one-step generation, optional sequential refinement, consistent latent space, and the potential for acting as a "global projector" to map various input distributions onto a target manifold.	The authors propose a training methodology with three key objectives: 1) Reconstruction: Data samples should be mapped to themselves. 2) Idempotence: Applying the network twice should yield the same result as applying it once. 3) Tightness: The set of instances mapped to themselves should be minimized. They achieve this through a novel self-adversarial training scheme using a single network.	The paper provides theoretical guarantees of IGN's convergence to the target distribution under ideal conditions. Experiments on MNIST and CelebA datasets demonstrate IGN's ability to generate realistic images from noise, perform latent space manipulations, and project out-of-distribution images (noisy, grayscale, sketches) onto the learned image manifold.	The authors acknowledge limitations such as mode collapse and blurriness in generated images, suggesting potential solutions like GAN mode collapse prevention techniques and perceptual or two-step loss functions. Future work aims to scale up IGN by training on larger datasets to explore its full potential.	diffusion_model, gan, generative_model, idempotence, image_generation, latent_space, projection, out-of-distribution
2405.01496	LocInv: Localization-aware Inversion for Text-Guided Image Editing	Chuanming Tang, Kai Wang, Fei Yang, Joost van de Weijer	Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at https://github.com/wangkai930418/DPL	This paper introduces Localization-aware Inversion (LocINV), a novel method for text-guided image editing that leverages localization priors like segmentation maps or bounding boxes to enhance the accuracy of cross-attention maps in diffusion models, thereby improving the precision of object manipulation and attribute editing.	This paper addresses the crucial issue of cross-attention leakage in text-guided image editing with diffusion models. Existing methods often struggle to precisely edit intended objects, leading to unintended alterations in other image regions. LocINV tackles this problem by incorporating readily available localization information, leading to more accurate and controllable image editing.	LocINV utilizes pre-trained Stable Diffusion models and incorporates localization priors (segmentation maps or bounding boxes) obtained from datasets or foundation models. By dynamically updating tokens associated with noun words during the denoising process, it refines cross-attention maps, enforcing better alignment with target objects. Additionally, for attribute editing, LocINV introduces an adjective binding loss to align adjective representations with corresponding nouns, improving the model's ability to edit object attributes.	Through extensive evaluations on a subset of the COCO dataset, LocINV consistently outperforms existing text-guided image editing methods in both quantitative metrics (LPIPS, SSIM, PSNR, CLIP Score, DINO-Sim) and qualitative comparisons. The method shows superior performance in local object Word-Swap tasks, preserving background integrity while accurately replacing target objects. Notably, LocINV demonstrates the novel capability for Attribute-Edit, successfully modifying object colors and materials by binding adjective and noun representations, a feature unexplored by most existing methods.	The authors acknowledge limitations related to the resolution of cross-attention maps, the editing capabilities of frozen Stable Diffusion models, and challenges in reconstructing high-frequency image details. Future work aims to explore pixel-level text-to-image models for finer control, integrate techniques like InstructPix2Pix for enhanced editing, and address limitations in reconstructing intricate image details.	diffusion_model, image_editing, text-guided, cross-attention, localization, segmentation, bounding_box, stable_diffusion, attribute_editing, word-swap
2401.12086	West-of-N: Synthetic Preference Generation for Improved Reward Modeling	Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn	The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.	This paper introduces West-of-N, a novel method for improving reward models in Reinforcement Learning from Human Feedback (RLHF) by generating synthetic preference data using Best-of-N sampling from a language model.	This work addresses the critical bottleneck of data scarcity in RLHF by proposing a scalable method for generating high-quality, on-policy preference data, potentially reducing the reliance on expensive and time-consuming human annotations.	The authors propose a self-training strategy where a base preference model (trained on initial data) is used to select the most and least preferred responses (West-of-N) from a pool generated by the language model. These synthetic preferences are then used to train a more accurate reward model.	Empirical results show that West-of-N significantly improves reward model accuracy and downstream language model alignment, outperforming baseline methods like RLAIF and RLCD. Notably, the gains from West-of-N are comparable to doubling the amount of human preference data.	Limitations include potential reward hacking by the base model when identifying West-of-N pairs with very large N. Future work could address this through reward model uncertainty estimation. Additionally, exploring other self-training techniques from the literature could further enhance West-of-N.	diffusion_model, llm, rlhf, preference_modeling, synthetic_data, self_training, best-of-n, reward_modeling
2403.02327	Model Lakes	Koyena Pal, David Bau, Renée J. Miller	Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of machine learning models increases, this issue of finding, differentiating, and understanding models is becoming more crucial. Inspired from research on data lakes, we introduce and define the concept of model lakes. We discuss fundamental research challenges in the management of large models. And we discuss what principled data management techniques can be brought to bear on the study of large model management.	This paper introduces the concept of "model lakes" as a way to manage and understand the growing number of deep learning models, drawing parallels to data lakes in the data management field.	This paper is important because it addresses the difficulty in finding, understanding, and comparing deep learning models due to the reliance on often incomplete or unreliable manual documentation. It proposes model lakes, inspired by data lakes, as a potential solution to these challenges.	This paper presents a vision paper that draws analogies from data management literature, particularly data lakes, and proposes a roadmap for future research in model management. It does not perform any experiments.	The paper doesn't have experimental results, being a vision paper. However, it proposes a model lake framework, outlines key challenges like content-based model search, related model search, documentation verification, data citation, provenance, version control, and discusses potential approaches inspired by solutions in data management for data lakes.	The authors identify limitations in current model management practices, including reliance on incomplete metadata and manual documentation. They propose future work on content-based model search, automated documentation verification, data citation for models, model provenance tracking, and model version management, emphasizing the need for standardized benchmarks and evaluation metrics.	model_lake, model_management, model_search, model_provenance, model_versioning, analysis, literature_review
2312.00785	Sequential Modeling Enables Scalable Learning for Large Vision Models	Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros	We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.	This paper introduces a Large Vision Model (LVM) trained solely on a massive dataset of visual data, formatted as "visual sentences," without relying on any linguistic information. This approach involves tokenizing images into discrete tokens using VQGAN and training a causal transformer model to predict the next token, enabling various vision tasks to be performed through visual prompting.	This paper is significant as it explores the potential of building large vision models analogous to large language models, demonstrating that visual understanding can be achieved without relying on language data. It pushes the boundaries of self-supervised learning in vision and paves the way for more general and scalable visual models capable of handling diverse tasks through in-context learning.	The authors curated a massive, diverse dataset of visual data called UVDv1, encompassing single images, image sequences, annotated images, annotated image sequences, and 3D synthetic objects, totaling 1.64 billion images. They introduced the concept of "visual sentences" to unify various data formats, treating each sentence as a sequence of visual tokens generated by a VQGAN tokenizer. A causal transformer model was trained to predict the next token in the sequence, enabling in-context learning for downstream tasks through visual prompting.	The paper demonstrates that the LVM exhibits strong scaling behavior, with larger models and more data leading to better performance on various vision tasks such as semantic segmentation, depth estimation, and keypoint detection, even outperforming some task-specific models on unseen datasets. The model also showcases an ability to generalize to novel tasks, handle out-of-distribution data, and perform basic visual reasoning, suggesting potential for more advanced visual understanding.	The authors acknowledge limitations such as computational constraints, under-constrained visual prompting compared to language, tokenizer limitations, and the relatively small size of the LVM compared to LLMs. Future work includes scaling up the model and exploring its capabilities in visual reasoning, emergence, and generalization.	diffusion_model, llm, analysis, 3d, motion, video, interpretability
2308.15321	Elucidating the Exposure Bias in Diffusion Models	Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, Itir Onal Ertugrul	Diffusion models have demonstrated impressive generative capabilities, but their \textit{exposure bias} problem, described as the input mismatch between training and sampling, lacks in-depth exploration. In this paper, we systematically investigate the exposure bias problem in diffusion models by first analytically modelling the sampling distribution, based on which we then attribute the prediction error at each sampling step as the root cause of the exposure bias issue. Furthermore, we discuss potential solutions to this issue and propose an intuitive metric for it. Along with the elucidation of exposure bias, we propose a simple, yet effective, training-free method called Epsilon Scaling to alleviate the exposure bias. We show that Epsilon Scaling explicitly moves the sampling trajectory closer to the vector field learned in the training phase by scaling down the network output, mitigating the input mismatch between training and sampling. Experiments on various diffusion frameworks (ADM, DDIM, EDM, LDM, DiT, PFGM++) verify the effectiveness of our method. Remarkably, our ADM-ES, as a state-of-the-art stochastic sampler, obtains 2.17 FID on CIFAR-10 under 100-step unconditional generation. The code is available at \url{https://github.com/forever208/ADM-ES} and \url{https://github.com/forever208/EDM-ES}.	This paper investigates the exposure bias problem in diffusion models, where the input mismatch between training and sampling leads to error accumulation and sampling drift. The paper analyzes the sampling distribution with prediction error, proposes a metric for quantifying exposure bias, and introduces Epsilon Scaling, a training-free method for alleviating this issue by scaling down the network output during sampling.	The paper is important because it provides an in-depth analysis of the exposure bias problem in diffusion models, which is a key factor affecting sample quality, especially in fast sampling scenarios. The proposed Epsilon Scaling method offers a simple yet effective solution to improve sample quality without retraining, making it widely applicable across different diffusion model architectures and samplers.	The authors first analytically model the sampling distribution by considering the prediction error. Then, they propose a metric (variance error) to quantify the exposure bias at each timestep. To address the exposure bias issue, they propose Epsilon Scaling, a training-free method that scales down the network output (epsilon) during sampling based on a linear schedule derived from the accumulated error. The authors evaluate their method using FID scores on various datasets (CIFAR-10, LSUN, FFHQ, ImageNet) and diffusion frameworks (ADM, DDIM, DDPM, EDM, LDM, DiT, PFGM++).	Epsilon Scaling consistently improves FID scores across various diffusion frameworks, datasets, and conditional settings. For instance, ADM-ES obtains 2.17 FID on CIFAR-10 under 100-step unconditional generation, outperforming previous state-of-the-art stochastic samplers. Epsilon Scaling is shown to effectively reduce exposure bias by moving the sampling trajectory closer to the vector field learned during training. The method exhibits insensitivity to the scaling parameter, requiring minimal effort to search for an optimal value.	The authors acknowledge that Epsilon Scaling corrects only the magnitude error of the network prediction, not the direction error, implying there is still room for improvement. Future work could focus on exploring methods to further reduce the exposure bias by addressing the direction error. Another avenue for future work is investigating the effectiveness of Epsilon Scaling on other diffusion-based applications beyond image generation, such as audio and video generation.	diffusion_model, exposure_bias, sampling, fid, analysis, training-free, image_generation, adm, ddim, ddpm, edm, ldm, dit, pfgm++
2402.17177	Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models	Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, Lichao Sun	Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.	This paper provides a comprehensive review of Sora, OpenAI's text-to-video generation model, exploring its background, related technologies, potential applications, limitations, and future directions.	Sora represents a significant breakthrough in AI, demonstrating the ability to generate high-quality, minute-long videos from text prompts, thus marking a milestone in AI-powered video generation and opening up possibilities in various fields.	The paper combines analysis of published technical reports and reverse engineering based on existing literature to dissect Sora's architecture, training methodologies, and capabilities.	The authors provide insights into Sora's architecture, including data pre-processing, the use of diffusion transformers, language instruction following, and prompt engineering. They highlight Sora's ability to handle variable video durations and resolutions, simulate complex scenes, and produce high-quality videos, while also pointing out current limitations in physical realism and human-computer interaction.	The paper identifies limitations like challenges in accurately depicting complex physical interactions, maintaining temporal accuracy, and limitations in user control for detailed modifications. It suggests future research directions such as exploring more robust training datasets, improving realism in physical simulations, and enhancing user interaction capabilities for finer control over video generation.	diffusion_model, llm, analysis, video, sora
2404.15447	GLoD: Composing Global Contexts and Local Details in Image Generation	Moyuru Yamada	Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.	This paper introduces Global-Local Diffusion (GLoD), a novel framework for controllable text-to-image generation using diffusion models, which allows simultaneous control over global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) by composing multiple global and local prompts.	This paper addresses the limitation of existing diffusion-based text-to-image generation methods that struggle to simultaneously control both global and local aspects of the generated image. GLoD offers a training-free approach for more complex and controllable image synthesis, which is crucial for real-world applications.	GLoD leverages pre-trained diffusion models and utilizes a novel layer composition approach. It takes global and local prompts as input, generates separate noises for each prompt, and then composes them effectively using global and local guidance mechanisms. This allows the model to incorporate both global context and local details into the generated image.	GLoD demonstrates superior performance in generating complex images that adhere to both global contexts and local details specified by the user. Quantitative evaluation shows improved alignment scores for both global and local attributes compared to existing methods, demonstrating better controllability. GLoD also effectively reduces undesirable attribute interference between objects in a scene.	One limitation identified is the potential for partial object appearance changes when the latent representation of the object differs significantly between the global and local prompts. Future work could explore techniques to mitigate this issue. Additionally, expanding the framework to handle more complex relationships between objects and exploring its application to other domains like video or 3D object generation are promising directions.	diffusion_model, text-to-image generation, controllable image synthesis, global context, local detail, layer composition, training-free
2404.12382	Lazy Diffusion Transformer for Interactive Image Editing	Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi	We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.	This paper introduces Gazelle, a novel diffusion transformer model designed for efficient partial image generation, particularly targeting interactive image editing applications like inpainting.	This work is important as it addresses the inefficiency of traditional inpainting methods that regenerate the entire image, even when editing small portions. Gazelle offers a significant speedup for localized edits while maintaining global consistency, making diffusion models more practical for interactive workflows.	The authors propose a two-stage approach: 1) A context encoder processes the entire image and mask to extract a compact global context specific to the masked region. 2) A diffusion-based transformer decoder iteratively generates only the masked pixels, conditioned on this context and the user's text prompt. This approach ensures global coherence while significantly reducing computational cost by focusing solely on the area of interest.	Gazelle achieves a speedup of up to 10x compared to full-image inpainting methods for masks covering 10% of the image. It demonstrates competitive quality with state-of-the-art inpainting models, especially in scenarios requiring high semantic context, indicating the effectiveness of its compressed context representation. User studies confirm a strong preference for Gazelle over crop-based methods and comparable preference to full-image methods.	The authors acknowledge limitations regarding the context encoder's quadratic scaling with input size, potentially limiting scalability to ultra-high-resolution images. They also identify occasional color inconsistencies between generated and visible regions. Future work could explore more efficient context encoding mechanisms and more principled solutions for seamless blending.	diffusion_model, transformer, inpainting, image_editing, interactive, context_encoding, latent_space, efficiency, poisson_blending
2401.04056	A Minimaximalist Approach to Reinforcement Learning from Human Feedback	Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal	We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a rater or preference model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.	This paper theoretically investigates the sample complexity of offline imitation learning (IL) with a focus on the effects of dataset composition and function class complexity.	The paper provides insights into crucial factors influencing the performance of IL algorithms, particularly in real-world scenarios where the quality and diversity of available data are significant concerns.	The authors derive upper and lower bounds on the sample complexity of IL under various settings. They analyze different dataset compositions (e.g., mixtures of experts, behavior-agnostic data) and consider their impact on the learning guarantees. Moreover, they utilize covering numbers and Rademacher complexities to characterize the complexity of the function class.	The paper demonstrates that learning is possible even with a small proportion of expert data mixed with a larger amount of sub-optimal data. They also show the impact of the function class complexity on sample efficiency. Notably, simpler function classes lead to faster learning.	The authors acknowledge limitations regarding the tightness of the derived bounds, especially in high-dimensional settings. They suggest exploring tighter bounds and practically relevant function classes in future work. Furthermore, they encourage the development of more efficient algorithms based on the theoretical insights gained from this study.	imitation learning, offline learning, sample complexity, dataset composition, function class complexity, covering number, analysis
2309.09887	On Model Explanations with Transferable Neural Pathways	Xinmiao Lin, Wentao Bao, Qi Yu, Yu Kong	Neural pathways as model explanations consist of a sparse set of neurons that provide the same level of prediction performance as the whole model. Existing methods primarily focus on accuracy and sparsity but the generated pathways may offer limited interpretability thus fall short in explaining the model behavior. In this paper, we suggest two interpretability criteria of neural pathways: (i) same-class neural pathways should primarily consist of class-relevant neurons; (ii) each instance's neural pathway sparsity should be optimally determined. To this end, we propose a Generative Class-relevant Neural Pathway (GEN-CNP) model that learns to predict the neural pathways from the target model's feature maps. We propose to learn class-relevant information from features of deep and shallow layers such that same-class neural pathways exhibit high similarity. We further impose a faithfulness criterion for GEN-CNP to generate pathways with instance-specific sparsity. We propose to transfer the class-relevant neural pathways to explain samples of the same class and show experimentally and qualitatively their faithfulness and interpretability.	This paper introduces GEN-CNP, a novel method for generating class-relevant neural pathway explanations for image recognition models, aiming to improve the interpretability of model explanations while maintaining faithfulness to the original model.	The paper addresses the limitations of existing neural pathway explanation methods that often lack interpretability and rely on global sparsity. It proposes class-wise and instance-specific interpretability concepts, enhancing the understanding of model behavior by revealing class-relevant features and allowing the transferability of explanations to other samples within the same class.	The authors propose GEN-CNP, a model that learns to predict neural pathways from the target model's feature maps. GEN-CNP uses Recursive Feature Embedders (RFEs) to extract feature patterns and Pathway Distillation Network (PDN) to learn class-relevant information from them. It utilizes Recursive Pathway Decoders (RPDs) with Distance Aware Quantization (DAQ) to decode importance scores into sparse and faithful neural pathways. They train GEN-CNP using knowledge distillation with sparsity constraints to ensure faithfulness to the target model and generate sparse explanations.	The proposed GEN-CNP method generates neural pathways with higher faithfulness to the original model, as demonstrated by improved performance on metrics like mIC and mDC. The generated pathways exhibit higher class-relevance, confirmed by higher acIOU scores and the transferability experiments, showing consistent and faithful explanations for samples within the same class. Qualitative visualizations using Grad-CAM and neural pathway gradients highlight that GEN-CNP identifies more semantically meaningful features compared to existing methods.	The authors acknowledge limitations in terms of computational cost and the current implementation's focus on image recognition models. Future work could explore more computationally efficient architectures for GEN-CNP and extend its applicability to other domains beyond image recognition, such as natural language processing or time series analysis.	diffusion_model, analysis, interpretability, neural_pathway
2308.07648	Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval	Chaorui Deng, Qi Chen, Pengda Qin, Da Chen, Qi Wu	In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.	This paper proposes Prompt Switch, an efficient method for adapting the CLIP model for text-video retrieval by introducing a Prompt Cube mechanism to enhance the learning of global and detailed video semantics, achieving state-of-the-art performance while maintaining high efficiency.	This paper addresses the efficiency bottleneck in existing CLIP-based text-video retrieval methods that rely on computationally expensive cross-modal fusion. It proposes a novel approach to enhance video representation learning within the CLIP framework, enabling efficient and effective retrieval by decoupling video and text modalities during inference.	The authors introduce a Prompt Cube, a 3D tensor integrated into the CLIP image encoder. This cube undergoes a Prompt Switch operation, transposing its spatial and temporal dimensions before each self-attention layer to capture global video semantics. Additionally, an auxiliary video captioning objective is employed during training to enhance the learning of detailed video semantics. Finally, a simple mean pooling strategy is used on the enhanced frame representations to obtain the video representation.	The proposed Prompt Switch method achieves state-of-the-art performance on three benchmark datasets (MSR-VTT, MSVD, LSMDC) for text-video retrieval, outperforming previous methods, especially under the text-agnostic temporal fusion setting. It demonstrates a significant improvement in efficiency compared to methods relying on cross-modal temporal fusion, making it more suitable for large-scale retrieval systems.	The authors acknowledge that their captioning module is relatively simple and might benefit from more advanced architectures. For future work, they suggest exploring other pre-training tasks or incorporating external knowledge to further enhance the model's performance.	clip, text-video retrieval, video representation learning, prompt learning, efficiency, auxiliary task learning, captioning
2404.05607	A Training-Free Plug-and-Play Watermark Framework for Stable Diffusion	Guokai Zhang, Lanjun Wang, Yuting Su, An-An Liu	Nowadays, the family of Stable Diffusion (SD) models has gained prominence for its high quality outputs and scalability. This has also raised security concerns on social media, as malicious users can create and disseminate harmful content. Existing approaches involve training components or entire SDs to embed a watermark in generated images for traceability and responsibility attribution. However, in the era of AI-generated content (AIGC), the rapid iteration of SDs renders retraining with watermark models costly. To address this, we propose a training-free plug-and-play watermark framework for SDs. Without modifying any components of SDs, we embed diverse watermarks in the latent space, adapting to the denoising process. Our experimental findings reveal that our method effectively harmonizes image quality and watermark invisibility. Furthermore, it performs robustly under various attacks. We also have validated that our method is generalized to multiple versions of SDs, even without retraining the watermark model.	This paper introduces a training-free, plug-and-play watermarking framework for Stable Diffusion models, enabling the embedding of diverse watermarks in the latent space without requiring any retraining of the SD model itself.	The paper addresses the growing concern of misuse of AI-generated content, particularly with the rapid evolution of SD models. The proposed framework provides a cost-efficient and adaptable solution for watermarking, ensuring traceability and responsibility attribution for generated images.	The authors develop a watermark encoder-decoder architecture trained solely on the frozen VAE encoder-decoder component of SD. During inference, the compressed watermark is embedded into the latent code after denoising, minimizing impact on image quality. The framework's generalization ability is analyzed, and extensive experiments are conducted to evaluate its performance on various SD versions and under different attacks.	The proposed framework demonstrates excellent watermark invisibility, achieving high PSNR and SSIM scores while minimally affecting image quality (even showing slight FID improvement). The watermark extraction quality is high, with NC exceeding 96%. The framework exhibits strong generalization across different SD versions (v1-1, v1-4, v1-5) without retraining and shows robustness against common image manipulations like blurring, cropping, and noise addition.	The authors acknowledge limitations in handling high-angle rotations due to the watermark's spatial dependence. Future work could explore rotation-invariant watermarking techniques. Additionally, while the framework minimizes noticeable artifacts, some localized pixel variations might occur in specific samples, requiring further investigation.	diffusion_model, stable diffusion, watermarking, training-free, plug-and-play, aigc, image_generation, robustness, latent_space
2309.04372	MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers	Sijia Li, Chen Chen, Haonan Lu	Diffusion-model-based text-guided image generation has recently made astounding progress, producing fascinating results in open-domain image manipulation tasks. Few models, however, currently have complete zero-shot capabilities for both global and local image editing due to the complexity and diversity of image manipulation tasks. In this work, we propose a method with a mixture-of-expert (MOE) controllers to align the text-guided capacity of diffusion models with different kinds of human instructions, enabling our model to handle various open-domain image manipulation tasks with natural language instructions. First, we use large language models (ChatGPT) and conditional image synthesis models (ControlNet) to generate a large number of global image transfer dataset in addition to the instruction-based local image editing dataset. Then, using an MOE technique and task-specific adaptation training on a large-scale dataset, our conditional diffusion model can edit images globally and locally. Extensive experiments demonstrate that our approach performs surprisingly well on various image manipulation tasks when dealing with open-domain images and arbitrary human instructions. Please refer to our project page: [https://oppo-mente-lab.github.io/moe_controller/]	This paper introduces MoEController, a novel method for arbitrary image manipulation guided by text instructions, tackling the challenge of performing both global and local image editing in a unified framework.	This paper is important as it addresses the limitations of existing image manipulation methods that struggle to effectively handle both global and local edits based on open-domain text instructions. It proposes a novel approach using a mixture-of-expert (MOE) framework to enhance the model's adaptability to diverse image manipulation tasks.	The authors first create a large-scale dataset for global image manipulation using ChatGPT to generate target captions and ControlNet to generate image pairs. They then design an MOE model with a fusion module, multiple expert models, and a gate system to discriminate between different instruction semantics and adapt to specific tasks. The model is trained with a reconstruction loss to ensure image entity consistency.	MoEController demonstrates superior performance in both global and local image manipulation tasks compared to existing methods. It effectively handles complex style transfers, local edits, and object manipulations. Quantitative evaluations using CLIP metrics and user studies confirm its effectiveness and adaptability to open-domain instructions.	The authors suggest extending MoEController to handle a wider range of human instructions and more complex image manipulation tasks in the future. Further exploration of expert model design and optimization of the gating mechanism could further improve performance.	diffusion_model, image_manipulation, llm, moe, controlnet, chatgpt, global_editing, local_editing
2404.06139	DiffHarmony: Latent Diffusion Model Meets Image Harmonization	Pengfei Zhou, Fangxiang Feng, Xiaojie Wang	Image harmonization, which involves adjusting the foreground of a composite image to attain a unified visual consistency with the background, can be conceptualized as an image-to-image translation task. Diffusion models have recently promoted the rapid development of image-to-image translation tasks . However, training diffusion models from scratch is computationally intensive. Fine-tuning pre-trained latent diffusion models entails dealing with the reconstruction error induced by the image compression autoencoder, making it unsuitable for image generation tasks that involve pixel-level evaluation metrics. To deal with these issues, in this paper, we first adapt a pre-trained latent diffusion model to the image harmonization task to generate the harmonious but potentially blurry initial images. Then we implement two strategies: utilizing higher-resolution images during inference and incorporating an additional refinement stage, to further enhance the clarity of the initially harmonized images. Extensive experiments on iHarmony4 datasets demonstrate the superiority of our proposed method. The code and model will be made publicly available at https://github.com/nicecv/DiffHarmony .	This paper introduces DiffHarmony, a novel image harmonization method that leverages a pre-trained latent diffusion model (Stable Diffusion) to generate harmonious images, enhanced by higher-resolution inference and a refinement stage to mitigate image distortion caused by the inherent compression in latent diffusion models.	This paper is significant because it addresses the limitations of applying pre-trained latent diffusion models to image harmonization, particularly the reconstruction errors due to image compression. It offers a novel approach to achieve state-of-the-art results on image harmonization tasks by effectively adapting and enhancing the capabilities of pre-trained latent diffusion models.	The authors adapt a pre-trained Stable Diffusion model for image harmonization by incorporating composite images and foreground masks as input conditions. To mitigate image distortion, they employ two strategies: using higher-resolution images during inference and adding a refinement stage using a UNet model. The method is evaluated on the iHarmony4 dataset using PSNR, MSE, and fMSE metrics and compared with other state-of-the-art methods.	DiffHarmony achieves state-of-the-art results on the iHarmony4 dataset, demonstrating the effectiveness of the proposed approach. Notably, the method excels in harmonizing images with larger foreground regions. Higher-resolution inference significantly improves performance, and the refinement stage further enhances the quality of generated images. Additionally, the authors conducted an ablation study to analyze the contribution of each component and performed an advanced analysis comparing their method with a state-of-the-art model trained on higher-resolution images.	The authors acknowledge that their method's performance on images with small foreground regions requires further investigation. Future work could explore using even higher image resolutions or employing better pre-trained diffusion models to address the limitations of information compression. Additionally, exploring alternative refinement techniques or more advanced network architectures for the refinement stage could lead to further improvements.	diffusion_model, image_harmonization, stable_diffusion, image_generation, refinement, vae, image_distortion, high_resolution
2312.03701	Return of Unconditional Generation: A Self-supervised Representation Generation Method	Tianhong Li, Dina Katabi, Kaiming He	Unconditional generation -- the problem of modeling data distribution without relying on human-annotated labels -- is a long-standing and fundamental challenge in generative models, creating a potential of learning from large-scale unlabeled data. In the literature, the generation quality of an unconditional method has been much worse than that of its conditional counterpart. This gap can be attributed to the lack of semantic information provided by labels. In this work, we show that one can close this gap by generating semantic representations in the representation space produced by a self-supervised encoder. These representations can be used to condition the image generator. This framework, called Representation-Conditioned Generation (RCG), provides an effective solution to the unconditional generation problem without using labels. Through comprehensive experiments, we observe that RCG significantly improves unconditional generation quality: e.g., it achieves a new state-of-the-art FID of 2.15 on ImageNet 256x256, largely reducing the previous best of 5.91 by a relative 64%. Our unconditional results are situated in the same tier as the leading class-conditional ones. We hope these encouraging observations will attract the community's attention to the fundamental problem of unconditional generation. Code is available at https://github.com/LTH14/rcg.	This paper introduces Representation-Conditioned Generation (RCG), a novel framework for unconditional image generation that leverages self-supervised representations to guide the generation process, effectively closing the quality gap between unconditional and conditional generation.	This paper is important because it addresses the long-standing challenge of poor-quality unconditional image generation compared to conditional methods. It proposes a method to leverage large-scale unlabeled datasets for training high-quality generative models by effectively utilizing self-supervised representations.	The authors propose a three-stage approach: 1) a pre-trained self-supervised encoder maps images to a representation space; 2) a lightweight diffusion model learns to generate representations within this space; 3) a conditional image generator (e.g., ADM, DiT, or MAGE) generates images conditioned on these representations.	RCG significantly improves unconditional generation quality across various image generators and datasets. It achieves state-of-the-art FID scores on ImageNet 256x256, surpassing previous unconditional methods and rivaling leading class-conditional methods. RCG also enables guidance in unconditional generation, further boosting performance. The method allows semantic interpolation by manipulating representations and can be easily extended to class-conditional generation.	The paper mentions that while RCG excels in generating diverse and high-quality images, it still faces challenges in generating text, regular shapes, and realistic humans, similar to other ImageNet generative models. Future work could explore pre-training on larger unlabeled datasets and adapting to various downstream generative tasks with minimal overhead by training only the representation generator on small labeled datasets.	diffusion_model, gan, unconditional generation, self-supervised representation, image generation, representation learning
2403.15378	Long-CLIP: Unlocking the Long-Text Capability of CLIP	Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang	Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.	The paper introduces Long-CLIP, an enhanced version of CLIP designed to address the limitation of short text input in the original model, enabling it to process longer and more detailed textual descriptions while retaining its zero-shot generalization capabilities.	This work is important as it enables CLIP to handle detailed descriptions, thereby broadening its applicability in image retrieval, text-to-image generation, and other tasks requiring comprehensive textual understanding. This advancement holds the potential to significantly enhance the performance and versatility of CLIP-based applications.	The authors propose two novel strategies: (1) Knowledge-Preserved Stretching: interpolating the positional embedding of less-trained positions while preserving the well-trained ones to support longer text input without disrupting the representation of short text positions; (2) Primary Component Matching: aligning both fine-grained image features with long captions and coarse-grained features (extracted using PCA) with short summary captions during fine-tuning to enable the model to capture detailed attributes and understand their importance. Long-CLIP is fine-tuned on the ShareGPT4V dataset, which contains image-long caption pairs.	Long-CLIP demonstrates superior performance compared to the original CLIP in various tasks, including: (1) Long-text image retrieval: It significantly improves the recall rate by approximately 20% on datasets like ShareGPT4V and Urban. (2) Short-text image retrieval: It also shows improvement on benchmarks like COCO and Flickr30k. (3) Zero-shot classification: It retains comparable performance to CLIP on ImageNet and CIFAR. (4) Text-to-image generation: It exhibits a plug-and-play effect, enabling existing models like Stable Diffusion to generate images from detailed descriptions without additional training.	The paper acknowledges that Long-CLIP, despite its improvements, still has a finite input length limit, although significantly extended. Future work could explore relative positional embeddings like RoPE to potentially overcome this limitation. Additionally, the authors suggest exploring the scaling-up potential of Long-CLIP by training with a larger dataset of long text-image pairs, as the current work only utilizes a relatively small portion of the ShareGPT4V dataset.	diffusion_model, clip, analysis, image_retrieval, text-to-image_generation, interpretability
2402.19427	Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models	Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre	Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.	This paper introduces Hawk and Griffin, two novel recurrent neural network architectures for language modeling that address the scalability limitations of traditional RNNs and offer advantages over Transformers on tasks involving long sequences.	This paper is important as it presents a potential solution to the long-standing challenge of efficiently scaling RNNs for language modeling. The proposed models, Hawk and Griffin, demonstrate competitive performance with Transformers while exhibiting superior efficiency in handling long sequences, which is crucial for various NLP tasks.	The authors developed Hawk, a pure RNN model based on the novel Real-Gated Linear Recurrent Unit (RG-LRU), and Griffin, a hybrid model combining RG-LRU with local attention. They conducted scaling experiments, training these models on the MassiveText dataset with up to 300B tokens, comparing their performance to Transformer baselines and state-of-the-art models like Mamba and Llama-2. They analyzed training efficiency on TPUs, inference speed, and capabilities in handling long contexts and performing tasks like copying and retrieval.	Hawk and Griffin demonstrated power-law scaling in training, matching the efficiency of Transformers. Hawk-3B outperformed Mamba-3B on downstream tasks despite being trained on half the data, and Griffin-7B and Griffin-14B achieved comparable results to Llama-2 with significantly less training data. They also exhibited faster inference, especially on longer sequences, due to their smaller memory footprint compared to Transformers. Notably, both models showed superior performance in extrapolating to longer sequences than those seen during training.	The authors acknowledge that while Griffin shows promise in copying and retrieval tasks, more research is needed to match the performance of Transformers in this domain, particularly when evaluating pre-trained models without fine-tuning. Future work could also involve exploring different local attention window sizes for Griffin, potentially dynamically adjusting them based on sequence length and hardware constraints.	rnn, transformer, language_model, long_sequence, efficiency, inference, scaling, local_attention, copying, retrieval
2402.13929	SDXL-Lightning: Progressive Adversarial Diffusion Distillation	Shanchuan Lin, Anran Wang, Xiao Yang	We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.	This paper introduces SDXL-Lightning, a novel diffusion distillation method that achieves state-of-the-art performance in one-step/few-step 1024px text-to-image generation based on SDXL.	This paper is important because it addresses the limitations of existing diffusion models in generating high-quality images with few inference steps, offering a significant speed and computational advantage over previous methods.	The authors propose a progressive adversarial diffusion distillation method. The approach combines progressive distillation with an adversarial loss function and uses a pre-trained diffusion UNet encoder as the discriminator backbone, enabling efficient distillation in latent space. The method progressively distills the model from 128 steps to 1 step, using both conditional and unconditional adversarial objectives to balance image quality and mode coverage.	The resulting SDXL-Lightning models achieve state-of-the-art performance in one-step/few-step 1024px text-to-image generation, exceeding the quality of previous methods like SDXL-Turbo and LCM. The models demonstrate superior high-resolution detail preservation while maintaining comparable text alignment and diversity. Notably, they even surpass the original SDXL model in quality for 4-step and 8-step generation.	The paper acknowledges limitations, including the need for separate checkpoints for different inference steps and the potential for further improvement in the UNet architecture for one-step generation. Future work could explore distilling models with multiple aspect ratios and researching optimal architectures for one-step generation.	diffusion_model, gan, distillation, text-to-image, adversarial_training, image_generation, sdxl
2311.04897	Future Lens: Anticipating Subsequent Tokens from a Single Hidden State	Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau	We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.	This paper investigates whether hidden state vectors in large language models (LLMs) encode information sufficient to predict multiple tokens ahead, going beyond the typical one-token prediction.	This research is significant because it probes the depth of information encoded within individual hidden states of LLMs, potentially revealing a deeper understanding of how these models process and retain information over longer spans of text.	The authors test their hypothesis by employing three methods on GPT-J-6B: (1) training linear models to approximate future hidden states and decode them, (2) conducting causal intervention by transplanting hidden states to different contexts, and (3) training a "soft prompt" to optimize the extraction of subsequent token information from a hidden state.	The study finds that individual hidden states, especially in middle layers, contain significant information about future tokens, going beyond immediate next-token predictions. Notably, the "learned prompt causal intervention" method achieves the highest accuracy in predicting subsequent tokens, even surpassing a bigram baseline.	The authors acknowledge limitations regarding the training data size, the focus on a single LLM (GPT-J-6B), the lack of prior baselines for this specific task, and the limitation of predicting up to four tokens ahead. Future work could explore larger datasets, other LLMs, alternative baseline models (e.g., RNNs, Non-Autoregressive generation), and extend the prediction horizon beyond four tokens.	llm, analysis, interpretability, transformer, hidden_state, causal_intervention
2311.18158	HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation	Yifan Zhang, Bryan Hooi	Diffusion models have revolutionized text-to-image generation, but their real-world applications are hampered by the extensive time needed for hundreds of diffusion steps. Although progressive distillation has been proposed to speed up diffusion sampling to 2-8 steps, it still falls short in one-step generation, and necessitates training multiple student models, which is highly parameter-extensive and time-consuming. To overcome these limitations, we introduce High-frequency-Promoting Adaptation (HiPA), a parameter-efficient approach to enable one-step text-to-image diffusion. Grounded in the insight that high-frequency information is essential but highly lacking in one-step diffusion, HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models. The learned adaptors empower these diffusion models to generate high-quality images in just a single step. Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation (37.3 $\rightarrow$ 23.8 in FID-5k on MS-COCO 2017) and 28.6x training speed-up (108.8 $\rightarrow$ 3.8 A100 GPU days), requiring only 0.04% training parameters (7,740 million $\rightarrow$ 3.3 million). We also demonstrate HiPA's effectiveness in text-guided image editing, inpainting and super-resolution tasks, where our adapted models consistently deliver high-quality outputs in just one diffusion step. The source code will be released.	This paper introduces High-frequency-Promoting Adaptation (HiPA), a parameter-efficient method for accelerating text-to-image diffusion models to generate high-quality images in a single step by training low-rank adaptors that enhance the model's ability to generate high-frequency details.	This paper is important because it addresses the limitations of existing text-to-image diffusion models, which require many diffusion steps and thus extensive processing time. HiPA provides a solution for real-time applications that rely on fast and high-quality image generation.	The authors analyze the image generation process of existing diffusion models and find that one-step diffusion lacks high-frequency details crucial for realistic image synthesis. They then propose HiPA, which trains low-rank adaptors using a novel adaptation loss that combines a spatial perceptual loss and a high-frequency promoted loss. This approach encourages the model to generate images with enhanced high-frequency details in just one step.	HiPA significantly outperforms previous one-step and few-step methods in terms of both image quality and training efficiency. Experiments on MS-COCO datasets demonstrate that HiPA achieves comparable results to multi-step diffusion models while being significantly faster. The method is also successfully applied to text-guided image editing, inpainting, and super-resolution, demonstrating its versatility for various real-world image generation tasks.	The authors acknowledge that while HiPA significantly improves one-step generation, there is still room for further enhancement in image quality compared to multi-step diffusion models. They suggest exploring the adaptation of more advanced diffusion models, such as SD-XL and DALL-E3, as a future direction. Another limitation is the occasional presence of artifacts in the generated images, which the authors attribute, in part, to limitations inherited from the original multi-step models. As a potential solution, they propose using HiPA for generating quick drafts and then refining them using the original multi-step model for higher quality.	diffusion_model, text-to-image, one-step generation, high-frequency, parameter-efficient, low-rank adaptation, image editing, inpainting, super-resolution
2311.17002	Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following	Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou	Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at https://ranni-t2i.github.io/Ranni.	This paper introduces Ranni, a text-to-image generation framework that enhances the controllability and accuracy of existing diffusion models by using a 'semantic panel' as a structured intermediary representation between text prompts and images.	This paper is important because it addresses the limitations of current text-to-image models in interpreting complex prompts by introducing a novel semantic panel that facilitates better text-image alignment and offers more intuitive editing capabilities.	The authors propose Ranni, a framework that leverages Large Language Models (LLMs) to translate text prompts into a structured 'semantic panel' containing visual concepts with attributes like bounding boxes, colors, and keypoints. This panel then guides a diffusion model to generate images that adhere more closely to the input text. They also introduce an automatic data preparation pipeline and conduct experiments on various prompts to evaluate Ranni's ability to follow instructions related to quantity, spatial relationships, attribute binding, and multiple objects.	Ranni demonstrates superior performance in following complex prompts compared to existing methods, particularly in terms of quantity awareness and spatial relationship understanding. It also shows promise as a unified image creation system, enabling interactive editing through manual manipulation or LLM-driven instructions in a chat-based interface.	The authors identify limitations, such as occasional inaccuracies in the initial semantic panel generation and the need for further exploration in controlling object appearance beyond bounding boxes. Future work could focus on improving the precision of the semantic panel, exploring alternative LLM architectures, and expanding the range of controllable attributes for enhanced editing capabilities.	diffusion_model, llm, text-to-image, controllable_generation, semantic_panel, interactive_editing, chat-based_generation
2405.08246	Compositional Text-to-Image Generation with Dense Blob Representations	Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat	Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.	This paper introduces BlobGEN, a text-to-image generation model that uses dense blob representations as grounding input to improve controllability and compositionality.	This paper addresses the limitations of existing text-to-image models in following complex prompts and offers a modular, user-friendly approach to control image generation by decomposing scenes into semantically rich visual primitives.	The authors propose dense blob representations, consisting of blob parameters (specifying location, size, orientation) and blob descriptions (text describing appearance), extracted using existing segmentation and captioning models. They develop a blob-grounded diffusion model with a novel masked cross-attention module to align blobs with corresponding visual features. Additionally, they introduce an in-context learning approach for LLMs to generate blob representations from text, enabling compositional generation.	BlobGEN achieves superior zero-shot generation quality on MS-COCO, showing lower FID scores compared to baseline models. It exhibits strong layout-guided controllability, evidenced by higher region-level CLIP scores and successful object editing and repositioning capabilities. When augmented with LLMs, BlobGEN excels in compositional generation tasks, surpassing LayoutGPT in numerical and spatial accuracy on the NSR-1K benchmark.	Limitations include the inability to perfectly reconstruct images solely from blobs, occasional failures in image editing, and robustness issues with LLM-generated blobs in compositional tasks. Future work could explore combining inversion methods for better reconstruction, advanced editing techniques to reduce editing failures, and improving the integration between LLMs and blob-grounded generation.	diffusion_model, llm, compositional image generation, layout-guided generation, blob representation, masked cross-attention, zero-shot generation, image editing
2404.07724	Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models	Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, Jaakko Lehtinen	Guidance is a crucial technique for extracting the best performance out of image-generating diffusion models. Traditionally, a constant guidance weight has been applied throughout the sampling chain of an image. We show that guidance is clearly harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle. We thus restrict it to a specific range of noise levels, improving both the inference speed and result quality. This limited guidance interval improves the record FID in ImageNet-512 significantly, from 1.81 to 1.40. We show that it is quantitatively and qualitatively beneficial across different sampler parameters, network architectures, and datasets, including the large-scale setting of Stable Diffusion XL. We thus suggest exposing the guidance interval as a hyperparameter in all diffusion models that use guidance.	This paper introduces a novel technique for enhancing image generation in diffusion models by strategically limiting the application of classifier-free guidance (CFG) to a specific interval of noise levels during the sampling process.	This research is significant as it addresses the sub-optimal performance of traditional CFG, which applies a constant guidance weight throughout the sampling process, leading to limitations in image quality and inference speed. By confining CFG to a specific noise level range, this technique allows for higher guidance weights, resulting in substantial improvements in image fidelity and a reduction in computational cost.	The authors begin by analyzing the impact of CFG at different noise levels using both a theoretical framework and empirical observations. They demonstrate that CFG is detrimental at high noise levels, largely unnecessary at low levels, and most beneficial in the middle stages of the sampling chain. Based on this insight, they propose a modified ODE for diffusion model sampling where guidance is applied only within a specific noise level interval. They evaluate their approach using quantitative metrics (FID, FDDINO) on ImageNet-512 and provide a qualitative analysis of the generated images using both ImageNet and Stable Diffusion XL. Ablation studies are performed to demonstrate the impact of varying guidance intervals and weights.	The proposed method achieves state-of-the-art FID scores on ImageNet-512, surpassing previous records by a significant margin. Notably, with their method, FID improves from 2.23 to 1.68 using EDM2-S and from 1.81 to 1.40 using EDM2-XXL. Qualitative results demonstrate that limiting the guidance interval preserves image diversity and reduces color saturation artifacts commonly observed with high guidance weights in standard CFG. The technique is shown to be effective across different sampler parameters, network architectures, and datasets, including Stable Diffusion XL.	The authors acknowledge that while their method significantly improves performance, future work could explore automatically determining the optimal guidance interval directly from the ODE. Additionally, further research is needed to understand the role of non-ideal, trained denoisers in the context of this technique.	diffusion_model, image_generation, classifier-free_guidance, sampling, fid, imagenet, stable diffusion xl
2310.11868	To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now	Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu	The recent advances in diffusion models (DMs) have revolutionized the generation of realistic and complex images. However, these models also introduce potential safety hazards, such as producing harmful content and infringing data copyrights. Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. Specifically, we investigated the adversarial robustness of DMs, assessed by adversarial prompts, when eliminating unwanted concepts, styles, and objects. We develop an effective and efficient adversarial prompt generation approach for DMs, termed UnlearnDiffAtk. This method capitalizes on the intrinsic classification abilities of DMs to simplify the creation of adversarial prompts, thereby eliminating the need for auxiliary classification or diffusion models.Through extensive benchmarking, we evaluate the robustness of five widely-used safety-driven unlearned DMs (i.e., DMs after unlearning undesirable concepts, styles, or objects) across a variety of tasks. Our results demonstrate the effectiveness and efficiency merits of UnlearnDiffAtk over the state-of-the-art adversarial prompt generation method and reveal the lack of robustness of current safety-driven unlearning techniques when applied to DMs. Codes are available at https://github.com/OPTML-Group/Diffusion-MU-Attack. WARNING: This paper contains model outputs that may be offensive in nature.	This paper focuses on the safety of diffusion models (DMs) for image generation. It introduces an adversarial attack method, called Diffusion-MU-Attack, to assess the robustness of 'unlearned' DMs, which are designed to mitigate the generation of harmful or undesired images.	This paper is important because it tackles the critical issue of safety in DMs, highlighting potential vulnerabilities in existing safety-driven approaches. It provides a valuable evaluation framework and a novel attack method to help improve the robustness and trustworthiness of DMs, especially important given their rapid adoption and potential for misuse.	The authors develop an adversarial prompt generation method, leveraging the concept of a 'diffusion classifier' inherent in well-trained DMs. This method optimizes text prompts to circumvent the safety mechanisms of unlearned DMs, compelling them to generate images containing the erased content. They evaluate their attack against several state-of-the-art unlearned DMs across three unlearning tasks: concept, style, and object unlearning. The effectiveness of the attack is measured by its success rate in generating images classified as containing the unlearned concepts, styles, or objects.	The results demonstrate the effectiveness of the proposed attack in bypassing the safety mechanisms of various unlearned DMs. Specifically, the attack successfully generates images classified as containing the erased concepts, styles, or objects with high success rates. Moreover, the attack is computationally efficient, as it does not require auxiliary diffusion or classification models. The results also reveal that current safety-driven unlearning techniques still lack robustness against adversarial prompt attacks.	The authors acknowledge that their work primarily focuses on evaluating the robustness of unlearned DMs against adversarial prompts, leaving other attack vectors unexplored. They suggest future work could investigate the robustness against attacks on other aspects of DMs, such as the noise generation process or the latent image representation. Additionally, they emphasize the need for developing more robust unlearning methods for DMs to address the vulnerabilities exposed by their attack.	diffusion_model, adversarial_attack, interpretability, safety, unlearning, machine_unlearning, robustness, text-to-image, image_generation
2312.14135	V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs	Penghao Wu, Saining Xie	When we look around and perform complex tasks, how we see and selectively process what we see is crucial. However, the lack of this visual search mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on important visual details, especially when handling high-resolution and visually crowded images. To address this, we introduce V, an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. When combined with an MLLM, this mechanism enhances collaborative reasoning, contextual understanding, and precise targeting of specific visual elements. This integration results in a new MLLM meta-architecture, named Show, sEArch, and TelL (SEAL). We further create VBench, a benchmark specifically designed to evaluate MLLMs in their ability to process high-resolution images and focus on visual details. Our study highlights the necessity of incorporating visual search capabilities into multimodal systems. The code is available https://github.com/penghao-wu/vstar.	This paper introduces SEAL, a novel framework that integrates an LLM-guided visual search mechanism into Multimodal Large Language Models (MLLMs) to enhance their visual grounding capabilities, especially for high-resolution images where details are crucial.	This paper addresses the limitations of current MLLMs in handling high-resolution images due to their reliance on pre-trained vision encoders with limited resolution and their inability to actively search for missing visual information. This is important as it highlights the need for a more human-like approach to visual processing in MLLMs, enabling them to handle more complex real-world scenarios.	The authors propose SEAL, a meta-architecture consisting of a VQA LLM and a visual search model. The VQA LLM identifies missing visual information, and the visual search model, guided by the LLM's world knowledge, efficiently locates and adds these details to a Visual Working Memory (VWM), enabling the VQA LLM to provide more informed answers. They also introduce V$^*$Bench, a benchmark to evaluate MLLMs on detailed visual grounding in high-resolution images.	The SEAL framework significantly outperforms existing open-source and commercial MLLMs on the V$^*$Bench, demonstrating the effectiveness of incorporating a visual search mechanism. Their ablation studies further validate the importance of their LLM-guided search strategy over simple detection-based approaches. Additionally, their analysis on the COCO-Search18 dataset shows that their LLM-guided visual search achieves efficiency comparable to human eye fixations during visual search tasks.	The authors acknowledge that their visual search model is currently designed for natural images and common objects, requiring further adaptation for handling documents, diagrams, videos, or open-world scenarios. They suggest exploring architectural improvements like incorporating convolution-based models for more efficient processing of high-resolution images.	diffusion_model, llm, analysis, 3d, video, interpretability
2309.17400	Directly Fine-Tuning Diffusion Models on Differentiable Rewards	Kevin Clark, Paul Vicol, Kevin Swersky, David J Fleet	We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.	This paper introduces DRaFT, a family of methods for efficiently fine-tuning diffusion models to maximize differentiable reward functions, such as human preference scores, through backpropagation through the sampling process, leading to improved generation quality.	This paper is important because it offers a more efficient and scalable alternative to reinforcement learning for aligning diffusion model outputs with human preferences or other complex objectives, which is crucial for deploying these models in real-world applications.	The authors propose DRaFT, which backpropagates reward gradients through the sampling chain, using LoRA and gradient checkpointing for efficiency. They introduce two variants: DRaFT-K, truncating backpropagation to the last K steps, and DRaFT-LV, reducing gradient variance for K=1. They evaluate their methods on Stable Diffusion 1.4 using various reward functions, including aesthetic scores, human preferences (PickScore, HPSv2), and tasks like image compressibility and adversarial example generation.	DRaFT significantly outperforms RL methods in sample efficiency for maximizing aesthetic scores. DRaFT-LV achieves the best reward value on the HPSv2 benchmark, learning faster than other methods. The authors demonstrate the effectiveness of DRaFT on various tasks like generating compressible/incompressible images, manipulating object presence using object detectors, and creating adversarial examples. They also show that LoRA scaling allows for controlling the strength of fine-tuning and combining models trained with different rewards.	The paper acknowledges the issue of reward hacking, where models exploit reward function limitations. Future work could explore addressing reward hacking and developing more robust reward functions. The authors also point to improving text alignment using powerful image captioning models as a potential research direction.	diffusion_model, reward_learning, fine-tuning, human_preference, aesthetic, lora, gradient_checkpointing, image_generation, adversarial_example, interpretability
2401.10219	Edit One for All: Interactive Batch Image Editing	Thao Nguyen, Utkarsh Ojha, Yuheng Li, Haotian Liu, Yong Jae Lee	In recent years, image editing has advanced remarkably. With increased human control, it is now possible to edit an image in a plethora of ways; from specifying in text what we want to change, to straight up dragging the contents of the image in an interactive point-based manner. However, most of the focus has remained on editing single images at a time. Whether and how we can simultaneously edit large batches of images has remained understudied. With the goal of minimizing human supervision in the editing process, this paper presents a novel method for interactive batch image editing using StyleGAN as the medium. Given an edit specified by users in an example image (e.g., make the face frontal), our method can automatically transfer that edit to other test images, so that regardless of their initial state (pose), they all arrive at the same final state (e.g., all facing front). Extensive experiments demonstrate that edits performed using our method have similar visual quality to existing single-image-editing methods, while having more visual consistency and saving significant time and human effort.	This paper introduces a novel method for interactive batch image editing using StyleGAN, enabling the automatic transfer of user-specified edits from an example image to a batch of test images while maintaining consistency in the final edited state.	This paper addresses the limitations of existing image editing techniques that primarily focus on single-image editing. It introduces the concept of interactive batch image editing, which significantly reduces human effort and time required for editing large image datasets while ensuring consistent results across images.	The authors propose a two-stage approach. First, they model the user's edit in the latent space of StyleGAN by optimizing an editing direction that captures the desired change while being globally consistent across images. Second, they derive a closed-form solution to adjust the editing strength for each test image, ensuring that all edited images converge to the same final state as the user-edited example.	The proposed method demonstrates superior performance in transferring various edits, such as point-based dragging and text-driven modifications, across different object categories like faces, animals, and human bodies. It achieves comparable visual quality to state-of-the-art single-image editing methods while being significantly faster and requiring minimal user annotation.	The authors acknowledge limitations in capturing fine-grained details and handling semantic discrepancies between the example and test images. Future work includes extending the approach to diffusion-based models for wider edit types and addressing limitations related to out-of-distribution samples.	diffusion_model, gan, image_editing, stylegan, batch_processing
2405.05967	Distilling Diffusion Models into Conditional GANs	Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park	We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models - DMD, SDXL-Turbo, and SDXL-Lightning - on the zero-shot COCO benchmark.	This paper introduces Diffusion2GAN, a method for distilling complex multi-step diffusion models into single-step conditional GANs, accelerating inference while preserving image quality by interpreting the process as a paired image-to-image translation task.	This paper is important because it addresses the slow inference speed of diffusion models, a major limitation hindering their real-time application in areas like text-to-image synthesis, 3D modeling, and video generation. By enabling one-step generation without significant quality loss, it paves the way for more practical and interactive applications of these powerful models.	The authors formulate diffusion distillation as a paired image-to-image translation problem, utilizing noise-to-image pairs from the diffusion model's ODE trajectory. They introduce E-LatentLPIPS, an efficient perceptual loss operating directly in the diffusion model's latent space, for effective regression. A multi-scale conditional discriminator with text alignment loss is also employed for enhanced performance.	Diffusion2GAN outperforms state-of-the-art one-step diffusion distillation models (DMD, SDXL-Turbo, SDXL-Lightning) on zero-shot COCO benchmarks. E-LatentLPIPS demonstrates superior efficiency compared to traditional LPIPS, enabling larger batch sizes. The method's effectiveness is shown by distilling both Stable Diffusion 1.5 and the larger SDXL model, achieving impressive FID and CLIP scores.	The paper acknowledges limitations in handling varying classifier-free guidance scales and the performance dependency on the teacher model. Future work could explore guided distillation techniques for CFG flexibility and leveraging real image-text pairs for surpassing teacher model limitations. Additionally, further investigation is needed to address the diversity drop observed when scaling up models.	diffusion_model, gan, distillation, image_generation, text-to-image, perceptual_loss, latent_space, one-step_generation, inference_speed
2310.17513	The Expressive Power of Low-Rank Adaptation	Yuchen Zeng, Kangwook Lee	Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.	This paper provides the first theoretical analysis of the expressive power of Low-Rank Adaptation (LoRA) for adapting pre-trained Fully Connected Neural Networks (FNN) and Transformer Networks (TFN). It identifies the necessary LoRA-rank for exactly adapting a frozen model to match a target model and quantifies the approximation error when the LoRA-rank is lower than the required threshold.	This paper is important because it provides the first known theoretical results on the expressive power of LoRA, a widely used and successful fine-tuning method. The findings contribute to understanding why LoRA is effective and offer insights for hyperparameter tuning and algorithm development.	The authors used a theoretical approach, starting with linear model approximation as a simplified scenario and extending the results to FNN and TFN with ReLU activation and softmax. They identified the required LoRA-rank by proving the existence of low-rank adapters that enable the adapted model to precisely match or approximate the target model under certain assumptions. The theoretical findings are validated by experiments on both synthetic and real datasets.	Key findings include: (1) LoRA can adapt any FNN to exactly represent any smaller target FNN if the LoRA-rank meets a certain threshold. (2) For TFNs, any model can be adapted to a target model of the same size with a rank equal to half the embedding size. (3) In both linear and FNN settings, the total number of parameters needed to achieve an exact approximation is constant regardless of the LoRA-rank assignment across layers. (4) LoRA can adapt randomly generated models to match the target model with fewer parameters than final layer tuning.	Limitations include the potential suboptimality of the constructed LoRA adapters, the lack of approximation error quantification for TFNs when the rank is lower than required, and the simplification of TFN architecture. Future work includes quantifying approximation errors for TFNs with insufficient ranks, refining LoRA adapter update algorithms, and studying LoRA's expressive power under more general TFN architecture settings.	lora, fine-tuning, fnn, tfn, analysis, expressive_power, approximation_error
2311.17042	Adversarial Diffusion Distillation	Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach	We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs, Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models. Code and weights available under https://github.com/Stability-AI/generative-models and https://huggingface.co/stabilityai/ .	This paper introduces Adversarial Diffusion Distillation (ADD), a novel approach for training diffusion models that generates high-quality images in just 1-4 sampling steps by combining adversarial training with score distillation from a pre-trained diffusion model.	This paper is important because it addresses the limitations of current diffusion models, particularly their slow inference speed due to the iterative sampling process, and offers a method for achieving real-time, high-quality image synthesis using foundation models.	The authors train a student diffusion model using a hybrid loss function consisting of two components: an adversarial loss that forces the model to generate realistic images and a score distillation loss that leverages the knowledge of a pre-trained teacher diffusion model. The model is trained to generate images from noisy inputs at various timesteps, using the same diffusion coefficients as the student model.	ADD outperforms existing few-step methods like Latent Consistency Models (LCMs) and GANs in single-step image synthesis. Notably, with four sampling steps, ADD-XL surpasses the performance of its teacher model, SDXL-Base, demonstrating its capability to generate high-fidelity images efficiently.	The authors acknowledge the potential for exploring different distillation weighting functions and scheduling strategies for further performance improvement. Future work could also involve investigating the application of ADD to other domains such as video and 3D generation.	diffusion_model, gan, distillation, image_generation, real-time, adversarial_training, score_distillation
2312.04837	Localized Symbolic Knowledge Distillation for Visual Commonsense Models	Jae Sung Park, Jack Hessel, Khyathi Raghavi Chandu, Paul Pu Liang, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi, Yejin Choi	Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to "point to" and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.	This paper introduces Localized Symbolic Knowledge Distillation (LSKD), a method for generating a localized visual commonsense corpus by prompting a large language model (LLM) with global and local image descriptions. This corpus is then used to train vision-language models that can accept region references as input, enabling more precise and context-aware reasoning within images.	This paper is important because it addresses the limitations of existing vision-language models in performing localized reasoning within images. By enabling users to specify regions of interest, it paves the way for more intuitive and precise multimodal interactions. Furthermore, the paper demonstrates that machine-generated, localized visual commonsense corpora can be as effective as human-annotated datasets, opening new avenues for scalable and cost-effective model training.	The authors propose a multi-stage approach: 1) Image-to-text verbalization of global image content, local region descriptions, and dynamic question-answer pairs. 2) Prompting an LLM (ChatGPT) to generate localized commonsense knowledge in a question-answer-rationale format. 3) Training a supervised critic model to filter out erroneous or low-quality generated instances. 4) Fine-tuning vision-language models (e.g., BLIP-2) on the filtered corpus for both discriminative and generative localized visual reasoning tasks.	Key findings include: 1) Training on the LSKD corpus significantly improves the performance of vision-language models on localized visual reasoning benchmarks, outperforming baselines and even surpassing models trained on human-annotated data in some cases. 2) A supervised critic model effectively filters out erroneous instances, leading to improved downstream task performance. 3) Generative models fine-tuned with LSKD show promising results in localized question-answering, demonstrating the potential for more interactive and human-like multimodal communication.	The authors acknowledge limitations such as the potential for verbalizer errors and the coverage of question types in the generated corpus. Future work could focus on developing more robust verbalization techniques, expanding the diversity of question types, and exploring more sophisticated critic models to further enhance the quality and coverage of the generated knowledge.	diffusion_model, llm, analysis, 3d, video, interpretability
2403.13043	When Do We Not Need Larger Vision Models?	Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell	Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S$^2$), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation. Notably, S$^2$ achieves state-of-the-art performance in detailed understanding of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the conditions under which S$^2$ is a preferred scaling approach compared to scaling on model size. While larger models have the advantage of better generalization on hard examples, we show that features of larger vision models can be well approximated by those of multi-scale smaller models. This suggests most, if not all, of the representations learned by current large pre-trained models can also be obtained from multi-scale smaller models. Our results show that a multi-scale smaller model has comparable learning capacity to a larger model, and pre-training smaller models with S$^2$ can match or even exceed the advantage of larger models. We release a Python package that can apply S$^2$ on any vision model with one line of code: https://github.com/bfshi/scaling_on_scales.	This paper explores the concept of "Scaling on Scales" (S^2) as a competitive alternative to increasing model size for enhancing visual representation in vision models, demonstrating that smaller models, when applied to multiple image scales, can outperform larger models in tasks like classification, segmentation, and depth estimation.	This paper challenges the prevailing assumption that larger models are always necessary for better visual understanding, proposing a more efficient scaling method that achieves comparable or superior performance with fewer parameters and similar computational cost, which has significant implications for research directions and resource allocation.	The authors introduce "S^2-Wrapper," a parameter-free mechanism extending pre-trained models to multi-scale feature extraction by splitting larger images into smaller sub-images and processing them independently before merging, and then conduct extensive experiments comparing S^2 with model size scaling across various tasks and datasets, including ImageNet, ADE20k, NYUv2, and robotic manipulation.	The key finding is that smaller models with S^2 scaling often match or surpass larger models in performance across various tasks, particularly excelling in dense prediction tasks such as segmentation and depth estimation, and achieving state-of-the-art performance in multimodal LLM visual detail understanding by scaling image resolution to 1008^2.	Limitations include the weaker generalization of smaller models pre-trained on a single scale compared to larger models on hard examples, and future work points towards exploring scale-selective processing for efficiency and enabling parallel processing of a single image for latency-critical scenarios.	diffusion_model, llm, analysis, 3d, motion, video, interpretability
2308.07686	Boosting Multi-modal Model Performance with Adaptive Gradient Modulation	Hong Li, Xingyu Li, Pengbo Hu, Yinuo Lei, Chunxiao Li, Yi Zhou	While the field of multi-modal learning keeps growing fast, the deficiency of the standard joint training paradigm has become clear through recent studies. They attribute the sub-optimal performance of the jointly trained model to the modality competition phenomenon. Existing works attempt to improve the jointly trained model by modulating the training process. Despite their effectiveness, those methods can only apply to late fusion models. More importantly, the mechanism of the modality competition remains unexplored. In this paper, we first propose an adaptive gradient modulation method that can boost the performance of multi-modal models with various fusion strategies. Extensive experiments show that our method surpasses all existing modulation methods. Furthermore, to have a quantitative understanding of the modality competition and the mechanism behind the effectiveness of our modulation method, we introduce a novel metric to measure the competition strength. This metric is built on the mono-modal concept, a function that is designed to represent the competition-less state of a modality. Through systematic investigation, our results confirm the intuition that the modulation encourages the model to rely on the more informative modality. In addition, we find that the jointly trained model typically has a preferred modality on which the competition is weaker than other modalities. However, this preferred modality need not dominate others. Our code will be available at https://github.com/lihong2303/AGM_ICCV2023.	This paper proposes Adaptive Gradient Modulation (AGM), a novel method for enhancing the performance of multi-modal learning models by adaptively controlling the gradient flow during training to mitigate modality competition.	This work is important because it addresses the sub-optimal performance of standard joint training in multi-modal learning, particularly the issue of modality competition where a dominant modality hinders the learning of other modalities. It provides a novel solution (AGM) applicable to various fusion strategies and offers insights into the dynamics of modality competition.	The authors develop AGM, which utilizes Shapley value-based attribution to isolate mono-modal responses and adaptively modulates the gradients of individual modalities during back-propagation. They introduce the concept of "mono-modal concept" to represent the ideal, competition-less state of a modality and use it to quantify the competition strength. Experiments are conducted on five multi-modal datasets (AV-MNIST, CREMA-D, UR-Funny, AVE, CMU-MOSEI) with varying fusion strategies, modalities, and network architectures to evaluate AGM's effectiveness and analyze modality competition.	The key findings demonstrate that AGM consistently outperforms existing modulation methods and significantly improves multi-modal models' accuracy across different datasets and architectures. The analysis reveals that AGM encourages models to leverage more informative modalities and mitigates the model's inherent bias towards specific modalities during training. The paper also establishes that modality competition is prevalent in multi-modal models, often with a "preferred modality" that the model tends to exploit. The strength of modality competition is found to be largely independent of the fusion strategy and modality type but appears to be influenced by the specific task and data characteristics.	The paper acknowledges the need for further investigation into the relationship between modality competition strength, modality information content, and data characteristics. Future work could explore more sophisticated methods for defining and utilizing the "mono-modal concept" and investigate the role of higher-order interactions among modalities in shaping competition dynamics.	diffusion_model, analysis, multi-modal learning, modality competition, gradient modulation, shapley value, fusion strategies
2308.08089	DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory	Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan	Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}	DragNUWA is an open-domain, diffusion-based video generation model that introduces fine-grained control over video content using text, image, and trajectory inputs, focusing on addressing limitations in trajectory control for open-domain scenarios.	This paper is important as it tackles two key limitations in existing controllable video generation models: lack of fine-grained control and limited ability to handle complex trajectories in open-domain settings. DragNUWA's innovative approach, including Trajectory Sampler, Multiscale Fusion, and Adaptive Training, allows for more comprehensive and user-friendly control over video generation, opening new avenues for creative applications.	DragNUWA leverages a diffusion-based model with a multi-stage training process. First, it uses a Trajectory Sampler to extract diverse trajectories from open-domain videos. Then, a Multiscale Fusion module integrates text, image, and trajectory data at different resolutions within the UNet architecture. Finally, Adaptive Training progressively adapts the model from dense optical flow conditions to user-defined sparse trajectories, ensuring stability and consistency in video generation.	DragNUWA demonstrates superior performance in fine-grained video generation. It can effectively control complex object trajectories, including curved paths and varying motion amplitudes, as well as handle camera movements like zooming in and out. The model highlights the importance of combining text, image, and trajectory inputs for achieving comprehensive control over semantic, spatial, and temporal aspects of video content.	The paper does not explicitly mention limitations but implies that incorporating video as a condition is beyond the scope of this research. Future work could explore the integration of video conditions for potential advancements in style transfer. Additionally, the paper primarily focuses on visual fidelity and controllability; investigating and improving the model's ability to generate temporally consistent and logically sound narratives could be a valuable direction for future research.	diffusion_model, video, motion, controllable_generation, trajectory, open-domain
2312.10835	Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models	Nikita Starodubcev, Artem Fedorov, Artem Babenko, Dmitry Baranchuk	Knowledge distillation methods have recently shown to be a promising direction to speedup the synthesis of large-scale diffusion models by requiring only a few inference steps. While several powerful distillation methods were recently proposed, the overall quality of student samples is typically lower compared to the teacher ones, which hinders their practical usage. In this work, we investigate the relative quality of samples produced by the teacher text-to-image diffusion model and its distilled student version. As our main empirical finding, we discover that a noticeable portion of student samples exhibit superior fidelity compared to the teacher ones, despite the "approximate" nature of the student. Based on this finding, we propose an adaptive collaboration between student and teacher diffusion models for effective text-to-image synthesis. Specifically, the distilled model produces the initial sample, and then an oracle decides whether it needs further improvements with a slow teacher model. Extensive experiments demonstrate that the designed pipeline surpasses state-of-the-art text-to-image alternatives for various inference budgets in terms of human preference. Furthermore, the proposed approach can be naturally used in popular applications such as text-guided image editing and controllable generation.	This paper explores a novel approach to text-to-image synthesis using an adaptive collaboration framework between a distilled student diffusion model and a teacher diffusion model.	The paper addresses the limitations of existing distillation methods for diffusion models, which often compromise image quality while achieving faster inference. The proposed approach aims to combine the efficiency of distilled models with the high fidelity of teacher models, potentially leading to a new paradigm in text-to-image generation.	The authors first analyze the performance of distilled text-to-image models and observe that a significant portion of the samples generated by students can be superior to the teacher. Based on this, they propose an adaptive pipeline where the student model generates an initial sample. An oracle, implemented using an image quality estimator (ImageReward), then decides whether to refine this sample further using the teacher model. This decision is made based on a learned threshold. The refinement process can be either a regeneration of the sample from scratch using the teacher or a refinement of the student's output.	The proposed adaptive collaboration framework outperforms existing text-to-image baselines in terms of both human preference and automated metrics (FID, CLIP score, ImageReward) under various inference budgets. The method achieves a 2.5x to 5x speedup compared to standard diffusion models while maintaining or even surpassing their quality. Furthermore, the approach is successfully applied to text-guided image editing and controllable generation tasks, demonstrating its versatility and potential for broader applications.	The authors acknowledge the limitations of current automated image quality estimators as a potential bottleneck for their approach. Future work could focus on developing more accurate estimators that better correlate with human preferences. Additionally, investigating the applicability of other fast text-to-image generation methods besides distillation, such as GANs, within their adaptive framework is suggested.	diffusion_model, gan, analysis, image_generation, knowledge_distillation, text-to-image
2308.16582	Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images	Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu	Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.	This paper introduces Any-Size-Diffusion (ASD), a two-stage pipeline designed to generate well-composed images of arbitrary sizes from text prompts, addressing the resolution-induced composition problems in existing text-to-image synthesis models.	This paper is important because it tackles the limitation of existing text-to-image models like Stable Diffusion, which often struggle to maintain good composition when generating images at different resolutions. The proposed ASD model allows for flexible image size generation while preserving compositional quality, significantly enhancing the capabilities of text-to-image synthesis.	The ASD pipeline works in two stages: 1) Any Ratio Adaptability Diffusion (ARAD) is trained on multi-aspect ratio images to generate an image based on text prompt and size, minimizing composition issues. 2) Fast Seamless Tiled Diffusion (FSTD) enlarges the ARAD output to any desired size using a novel implicit overlap technique during tiled sampling, ensuring both speed and seamless image magnification.	ASD demonstrates superior performance in generating well-composed images of arbitrary sizes, confirmed through quantitative and qualitative evaluation. Experiments show ASD achieves a 33.49 reduction in FID score compared to the baseline Stable Diffusion model and generates images up to 9 times higher resolution on the same hardware. The implicit overlap in FSTD effectively addresses seaming artifacts common in tiled diffusion methods, achieving high-fidelity image magnification while maintaining a speed comparable to non-overlapping tiling.	The paper acknowledges a potential limitation in the computational cost associated with increasing the number of tiles in FSTD for higher resolutions. Future work could explore optimization strategies to mitigate this, further enhancing the model's efficiency. Additionally, the authors suggest exploring the application of ASD in other domains such as video generation and 3D object synthesis.	diffusion_model, text-to-image, image_synthesis, super-resolution, compositionality, tiled_diffusion
2402.12354	LoRA+: Efficient Low Rank Adaptation of Large Models	Soufiane Hayou, Nikhil Ghosh, Bin Yu	In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.	This paper investigates the efficiency of Low Rank Adaptation (LoRA) for finetuning large language models and identifies suboptimal feature learning when using the same learning rate for adapter matrices A and B, especially in models with large embedding dimensions.	The paper is important because it provides theoretical insights into the optimal setting of learning rates for LoRA, a widely used technique for efficient finetuning of large language models, and proposes a simple yet effective improvement called LoRA+.	The authors utilize scaling arguments, analyzing the behavior of LoRA in the infinite-width limit. They study a simplified linear model and then extend their analysis to general neural architectures with LoRA layers, demonstrating the inefficiency of using equal learning rates for A and B and deriving optimal scaling rules for these learning rates.	The key finding is that setting different learning rates for the LoRA adapter matrices A and B, specifically η_A = Θ(n^-1) and η_B = Θ(1), leads to efficient feature learning in the infinite-width limit. Empirically, they show that LoRA+ with a learning rate ratio of η_B/η_A ≈ 2^4 consistently improves finetuning speed and performance on various tasks and language models, including GPT-2, RoBERTa, and LLama-7b.	The paper acknowledges limitations in precisely determining the optimal learning rate ratio for different tasks and models, suggesting that the ratio is task and model dependent. Future work could involve a more refined analysis to estimate the optimal ratio based on task and model characteristics, potentially leading to further performance improvements.	diffusion_model, llm, analysis, finetuning, lora, optimization
2311.10093	The Chosen One: Consistent Characters in Text-to-Image Diffusion Models	Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski	Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, these models struggle with generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach. Project page is available at https://omriavrahami.com/the-chosen-one	The paper proposes a fully automated method for generating consistent characters in different contexts using text-to-image diffusion models, taking only a text prompt as input.	This paper addresses a crucial limitation in current text-to-image models: the inability to generate consistent characters across various scenes, which is important for storytelling, game development, and other creative applications. The proposed method offers a fully automated solution, unlike existing manual or limited approaches.	The method iteratively refines the representation of a character. It generates a gallery of images from a text prompt, embeds them in a feature space (using DINOv2), clusters the embeddings, and chooses the most cohesive cluster. This cluster is used to personalize a text-to-image model (SDXL) via textual inversion and LoRA, yielding a refined character representation. The process is repeated until convergence, ensuring consistent character generation in diverse contexts.	The method effectively balances prompt adherence and identity consistency compared to baselines like Textual Inversion, LoRA DreamBooth, ELITE, BLIP-diffusion, and IP-adapter. Quantitative analysis and a user study confirm its effectiveness in generating diverse depictions of consistent characters.	The authors acknowledge limitations such as occasional inconsistencies in identity, challenges with consistent supporting characters, potential for spurious attributes, high computational cost, and tendency to generate simplistic scenes. They suggest future work on reducing these limitations and exploring broader applications like story generation and interactive character design.	diffusion_model, consistent_character, personalization, text-to-image, clustering, analysis, user_study, sdxl, dinov2
2402.11411	Aligning Modalities in Vision Large Language Models via Preference Fine-tuning	Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao	Instruction-following Vision Large Language Models (VLLMs) have achieved significant progress recently on a variety of tasks. These approaches merge strong pre-trained vision models and large language models (LLMs). Since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. This procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core LLM is highly factual and the vision backbone has sufficiently complete representations. In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. Specifically, we propose POVID to generate feedback data with AI models. We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. First, we prompt GPT-4V to inject plausible hallucinations into the correct answer. Second, we distort the image to trigger the inherent hallucination behavior of the VLLM. This is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. Finally, both of these generation strategies are integrated into an RLHF pipeline via Direct Preference Optimization. In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. Our data and code are available at https://github.com/YiyangZhou/POVID.	This paper introduces POVID, a novel approach for aligning image and text modalities in Vision Large Language Models (VLLMs) to mitigate hallucination issues using AI-generated dispreferences for preference tuning.	This paper addresses the significant problem of hallucinations in VLLMs, where the model generates text that doesn't accurately reflect the image content. This is crucial for deploying VLLMs in real-world applications where accuracy is paramount.	The authors propose POVID, a two-stage approach. First, they utilize GPT-4V to create plausible hallucinations in ground-truth image captions and reasoning tasks, generating dispreferred responses. Second, they introduce noise into the input images during training to trigger inherent VLLM hallucination patterns, further improving modality alignment using a modified DPO loss.	POVID significantly outperforms previous VLLM preference tuning methods, achieving a 31.78% improvement on hallucination benchmarks and consistent gains on comprehensive VLLM benchmarks. It effectively reduces hallucinations and shows superior performance in image captioning and detailed description tasks.	The paper doesn't explicitly mention limitations. Future work could explore different noise injection techniques, expand to other VLLM architectures, and investigate the generalization of POVID to other multimodal tasks beyond image captioning and reasoning.	diffusion_model, llm, hallucination, alignment, vllm, image_captioning, reasoning, preference_tuning, dpo
2309.07986	Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models	James Burgess, Kuan-Chieh Wang, Serena Yeung	Text-to-image diffusion models understand spatial relationship between objects, but do they represent the true 3D structure of the world from only 2D supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion, and we show that this structure can be exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models. We train a small neural mapper to take camera viewpoint parameters and predict text encoder latents; the latents then condition the diffusion generation process to produce images with the desired camera viewpoint. ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the frozen diffusion model as a prior, we can solve NVS with very few input views; we can even do single-view novel view synthesis. Our single-view NVS predictions have good semantic details and photorealism compared to prior methods. Our approach is well suited for modeling the uncertainty inherent in sparse 3D vision problems because it can efficiently generate diverse samples. Our view-control mechanism is general, and can even change the camera view in images generated by user-defined prompts.	This paper introduces Viewpoint Neural Textual Inversion (ViewNeTI), a method for controlling the viewpoint of objects in images generated by text-to-image diffusion models, enabling novel view synthesis from as little as a single input view.	This paper is important because it demonstrates that 2D diffusion models, despite being trained on unposed images, encode 3D structural knowledge that can be leveraged for 3D vision tasks like novel view synthesis, even with very limited 3D supervision.	The authors train a small neural network, the view-mapper, to predict text encoder latents based on camera viewpoint parameters. These latents, along with object-specific latents, condition a frozen diffusion model (Stable Diffusion) to generate images from desired viewpoints. They explore single-scene training for viewpoint interpolation and multi-scene pretraining for generalization to novel scenes and single-view synthesis.	ViewNeTI achieves impressive results for novel view synthesis, especially in the challenging single-view setting. It generates photorealistic images with plausible semantics, outperforming baselines in terms of visual quality and certain metrics like LPIPS. The method also demonstrates potential for viewpoint control in text-to-image generation.	The paper acknowledges limitations in object localization, which affects PSNR scores, and struggles with generating precise object details. Future work could address these limitations, explore faster inference for object token optimization, and investigate applying the framework to other 3D tasks like relighting and 2D-to-3D lifting.	diffusion_model, novel_view_synthesis, textual_inversion, 3d, single-view, viewpoint_control, stable diffusion
2404.19756	KAN: Kolmogorov-Arnold Networks	Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, Max Tegmark	Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.	This paper introduces Kolmogorov-Arnold Networks (KANs), a novel neural network architecture inspired by the Kolmogorov-Arnold representation theorem, as a promising alternative to Multi-Layer Perceptrons (MLPs) for function approximation, featuring learnable activation functions on edges.	The paper is important because it challenges the dominance of MLPs in deep learning by presenting KANs as a more accurate and interpretable alternative, especially in scientific domains. KANs exhibit faster neural scaling laws, better handle the curse of dimensionality for functions with compositional structure, and offer improved interpretability, potentially making them valuable for AI-driven scientific discovery.	The authors generalize the Kolmogorov-Arnold representation theorem to arbitrary network depths and widths. They parameterize each weight in the network as a learnable 1D spline function, allowing for fine-grained control over function approximation. The paper includes extensive experiments on toy datasets, special functions, Feynman equations, partial differential equations, and real-world scientific datasets in knot theory and condensed matter physics to demonstrate KANs' advantages in accuracy and interpretability. The authors also propose simplification techniques like sparsity regularization and pruning to enhance interpretability.	KANs consistently outperform MLPs in terms of accuracy and parameter efficiency across various tasks, including function fitting, PDE solving, and symbolic regression. Their test loss scales favorably with the number of parameters, approaching the theoretically predicted scaling exponent. KANs demonstrate an ability to learn complex functions, including special functions and phase transition boundaries. They can be simplified and visualized to reveal underlying compositional structures and enable symbolic regression with human interaction. In applications to scientific datasets, KANs rediscover known mathematical relations in knot theory and uncover mobility edges in condensed matter physics, highlighting their potential for AI-driven scientific discovery.	The authors acknowledge that the mathematical understanding of deeper KANs is limited and propose a generalized Kolmogorov-Arnold theorem as future work. Algorithmically, they identify potential improvements in accuracy, efficiency, and training strategies, including adaptive grids and hybrid KAN-MLP architectures. They also suggest expanding KAN applications to other scientific domains and integrating them into existing architectures like transformers. A key limitation is the current slow training speed of KANs compared to MLPs.	diffusion_model, analysis, interpretability, neural_scaling_law, pde, scientific_discovery, symbolic_regression
2401.08740	SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers	Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, Saining Xie	We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: using discrete vs. continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions, and deploying a deterministic or stochastic sampler. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 benchmark using the exact same backbone, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06.	This paper introduces Scalable Interpolant Transformers (SiT), a class of generative models based on Diffusion Transformers (DiT) that leverage stochastic interpolants to achieve improved performance in image generation.	This work is important because it provides a detailed analysis of the design choices involved in building generative models based on dynamical transport, potentially leading to more efficient and higher-performing models. Specifically, it demonstrates a consistent performance gain over DiT by carefully selecting the interpolant connecting data and noise distributions, and by choosing to learn the velocity field of the interpolating process instead of the score.	The authors start with the DDPM framework and systematically analyze the effects of different design choices on the ImageNet 256x256 benchmark. They experiment with discrete vs. continuous time learning, predicting velocity vs. score, using different interpolants like linear and generalized variance-preserving (GVP), and employing deterministic (Heun) vs. stochastic (Euler-Maruyama) samplers with tunable diffusion coefficients.	SiT consistently outperforms DiT in FID scores across all model sizes, demonstrating the effectiveness of using stochastic interpolants and learning the velocity field. The paper also finds that SDE-based sampling generally leads to better performance than ODE-based sampling, and that the optimal diffusion coefficient for SDE sampling depends on the choice of interpolant and model. Using classifier-free guidance further enhances SiT's performance, achieving a FID-50K score of 2.06, surpassing DiT in all comparable settings.	The authors acknowledge that the performance of different samplers might vary under different computational budgets. They plan to explore the application of SiT to other downstream tasks, such as video generation and image editing, in future work. Additionally, they plan to investigate potential performance improvements by combining SiT with other advanced sampling techniques and architectural modifications.	diffusion_model, gan, interpolant, analysis, image_generation, transformer, sde, ode
2310.05914	NEFTune: Noisy Embeddings Improve Instruction Finetuning	Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein	We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.	This paper introduces NEFTune, a simple yet effective technique for improving instruction fine-tuning of large language models (LLMs) by adding noise to embedding vectors during training.	This paper is important because it presents a novel approach to enhance the performance of instruction-tuned LLMs, addressing the critical need for efficient use of limited instruction datasets in LLM training.	The authors employ NEFTune, which involves adding scaled uniform noise to the embedding vectors during the forward pass of fine-tuning. They evaluate NEFTune's impact on various LLM architectures, including LLaMA-1, LLaMA-2, and OPT, using different instruction-tuning datasets like Alpaca, Evol-Instruct, ShareGPT, and OpenPlatypus. The evaluation leverages AlpacaEval and OpenLLM Leaderboard tasks to assess the conversational quality and factual accuracy of the models.	NEFTune significantly improves the performance of LLMs across different model sizes and datasets, leading to more fluent and informative responses. Notably, it exhibits an average improvement of 15% in AlpacaEval Win Rate. Additionally, the authors find that NEFTune helps mitigate overfitting to the instruction datasets, allowing the models to generalize better and generate more human-like responses.	The authors acknowledge limitations such as reliance on AlpacaEval and limited computational resources for evaluating larger models. Future work includes exploring the impact of NEFTune on model safety and reliability, investigating its effectiveness with larger model variants (e.g., 70B parameters) across multiple datasets, and gaining a deeper understanding of the underlying mechanisms by which NEFTune improves performance.	diffusion_model, llm, analysis, instruction_finetuning, overfitting, regularization, embedding, conversational_ai
2402.16842	Asymmetry in Low-Rank Adapters of Foundation Models	Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz Sáez de Ocáriz Borde, Rickard Brüel Gabrielsson, Leshem Choshen, Marzyeh Ghassemi, Mikhail Yurochkin, Justin Solomon	Parameter-efficient fine-tuning optimizes large, pre-trained foundation models by updating a subset of parameters; in this class, Low-Rank Adaptation (LoRA) is particularly effective. Inspired by an effort to investigate the different roles of LoRA matrices during fine-tuning, this paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices. Specifically, when updating the parameter matrices of a neural network by adding a product $BA$, we observe that the $B$ and $A$ matrices have distinct functions: $A$ extracts features from the input, while $B$ uses these features to create the desired output. Based on this observation, we demonstrate that fine-tuning $B$ is inherently more effective than fine-tuning $A$, and that a random untrained $A$ should perform nearly as well as a fine-tuned one. Using an information-theoretic lens, we also bound the generalization of low-rank adapters, showing that the parameter savings of exclusively training $B$ improves the bound. We support our conclusions with experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs.	This paper investigates the asymmetry in the roles of adapter matrices in Low-Rank Adaptation (LoRA) for fine-tuning large language models, finding that the matrix projecting input features to a lower dimension (A) plays a less crucial role than the matrix mapping these features to the output (B).	The paper is important because it provides theoretical and empirical evidence for simplifying and improving the efficiency of LoRA fine-tuning, suggesting that using a fixed, randomly initialized A matrix while solely tuning B can lead to comparable or better performance with reduced parameter usage and improved generalization.	The authors analyze the asymmetry in LoRA through theoretical analysis of linear regression and nonlinear loss functions, along with empirical evaluations across diverse tasks, including natural language understanding (GLUE, MMLU), generation (XSum, CNN/DailyMail), and image classification (DomainBed) using RoBERTa, BART-Large, LLaMA-2, and Vision Transformer (ViT) models.	The key results demonstrate that: (1) Tuning only the B matrix in LoRA generally outperforms tuning only A, confirming its greater importance. (2) Using a random orthogonal matrix for A while tuning B can achieve comparable or even superior performance to standard LoRA, especially when the rank of B is increased to match the parameter count, suggesting this approach improves parameter efficiency and generalization. (3) The asymmetry and benefits of tuning only B are observed across different models (RoBERTa, BART-Large, LLaMA-2, ViT) and tasks, including language understanding, generation, and image classification, indicating its broad applicability.	The paper acknowledges limitations in the theoretical analysis, which primarily focuses on linear models and single-layer networks, and suggests extending the analysis to more complex and realistic network architectures as future work. Further exploration of the relationship between the random initialization of A and input data distribution is also proposed.	llm, lora, peft, fine-tuning, analysis, generalization, parameter_efficiency, text_generation, text_classification, image_classification
2404.00384	TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias	Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, Kyungsu Kim	We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.	This paper identifies and addresses the "single tag bias" in CLIP-based models, where the models disproportionately focus on a single tag in image-text relationships, and proposes a novel fine-tuning method called Text-Tag Self-Distillation (TTD) to mitigate this bias.	This paper is important because it addresses a critical limitation in CLIP-based models that hinders their performance in multi-tag classification and segmentation tasks. By mitigating the single tag bias, the paper paves the way for improved image-text alignment and opens up possibilities for more accurate and robust open-vocabulary applications.	The authors propose a two-step approach: 1) Tag Selection by Pixel-Tag Scoring: Instead of relying on global image embeddings prone to bias, they compute similarity scores between each tag and its most correlated pixel, enabling more accurate identification of image-relevant tags. 2) Text-Tag Self-Distillation: They generate an ideal image-text similarity map reflecting all relevant tags and use it to guide the model to learn from all relevant tags during fine-tuning, thus mitigating the single tag bias.	The proposed method demonstrates significant improvements in both multi-tag classification and segmentation tasks. It outperforms existing methods relying on external NLP models for tag selection and achieves superior results in capturing the relationship between images and multi-object text descriptions. The method also shows promising results in open-vocabulary semantic segmentation on various benchmarks, including Pascal VOC, COCO, and ADE20k.	The authors acknowledge limitations in their current tagging method, which relies on single text inputs per image, potentially limiting the amount of positive/negative tag information utilized during training. As future work, they suggest exploring the integration of multiple text inputs per image to enrich the learning process. Additionally, they plan to investigate the underlying causes of single tag bias, such as model overfitting or training data characteristics, to further enhance the model's performance.	diffusion_model, clip, analysis, segmentation, open-vocabulary, image-text alignment, self-distillation, bias
2403.05135	ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu	Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.	This paper introduces ELLA, a lightweight adapter that equips text-to-image diffusion models with Large Language Models (LLMs) to enhance text alignment without retraining either model. It achieves this by using a Timestep-Aware Semantic Connector (TSC) that dynamically extracts timestep-dependent conditions from the LLM to guide the diffusion process.	This paper addresses the limitation of existing text-to-image diffusion models that struggle with comprehending and following long, dense prompts containing multiple objects, attributes, and relationships. ELLA provides an efficient and effective solution by leveraging the power of LLMs while remaining compatible with existing diffusion models and tools.	The authors propose a novel architecture, ELLA, that connects a frozen pre-trained LLM (e.g., T5-XL, LLaMA-2) with a frozen pre-trained diffusion model (e.g., Stable Diffusion). The key component, TSC, takes text features from the LLM and the current timestep embedding as input, dynamically extracting semantic information relevant to different stages of the denoising process. To train TSC, the authors constructed a large dataset of image-text pairs with dense captions generated by MLLMs. They also introduce a new benchmark, Dense Prompt Graph Benchmark (DPG-Bench), to evaluate models' ability to follow dense prompts.	ELLA significantly improves the performance of existing diffusion models in following complex prompts. It outperforms state-of-the-art models on DPG-Bench and shows better text-image alignment than SDXL and PixArt-α in user studies while maintaining comparable aesthetic quality. ELLA's lightweight design allows for easy integration with community models and downstream tools like LoRA and ControlNet, enhancing their prompt-following capabilities. Ablation studies validate the effectiveness of LLM selection, TSC design, and the importance of incorporating timestep information.	The paper acknowledges limitations in their training captions, which are synthesized by MLLM and might be unreliable for shape and spatial relationships. The authors plan to address this by exploring the integration of MLLM with diffusion models to utilize interleaved image-text input. Another limitation is the potential constraint on the aesthetic quality of generated images due to the frozen U-Net. Future work will focus on image editing capabilities and improving aesthetic quality.	diffusion_model, llm, text-to-image, semantic_alignment, dense_prompt, timestep-aware, benchmark, analysis
2310.13730	Localizing and Editing Knowledge in Text-to-Image Generative Models	Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun Manjunatha	Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods.	This paper investigates how knowledge about different visual attributes is stored in large-scale text-to-image diffusion models, specifically focusing on Stable Diffusion.	Understanding knowledge storage in text-to-image models is crucial for interpreting their decision-making and enabling targeted model editing without expensive retraining.	The authors adapt Causal Mediation Analysis to trace knowledge corresponding to visual attributes like objects, style, color, and action within the UNet and text-encoder components of Stable Diffusion. They identify causal components by corrupting specific attribute information in captions and observing the impact of restoring activations from a clean model.	The study reveals that knowledge in the UNet is distributed across various components with different efficacy for different attributes, unlike the localized storage observed in large language models. Remarkably, the CLIP text-encoder exhibits a single causal state: the first self-attention layer corresponding to the last subject token of the attribute in the caption. This finding led to the development of \difffix{}, a fast, data-free model editing method that leverages this localized causal state for efficient concept editing.	The paper primarily focuses on Stable Diffusion, leaving analysis on other models for future work. Additionally, exploring deeper into individual layer components, such as neurons, and investigating robustness to adversarial attacks are identified as potential research avenues. The authors also acknowledge the need to address the generalization of edits to neighboring concepts, as observed in the Eiffel Tower ablation example where edits did not fully propagate to related scenery.	diffusion_model, analysis, interpretability, text-to-image, stable-diffusion, causal mediation analysis, model editing
2308.07673	A Review of Adversarial Attacks in Computer Vision	Yutong Zhang, Yao Li, Yin Li, Zhichang Guo	Deep neural networks have been widely used in various downstream tasks, especially those safety-critical scenario such as autonomous driving, but deep networks are often threatened by adversarial samples. Such adversarial attacks can be invisible to human eyes, but can lead to DNN misclassification, and often exhibits transferability between deep learning and machine learning models and real-world achievability. Adversarial attacks can be divided into white-box attacks, for which the attacker knows the parameters and gradient of the model, and black-box attacks, for the latter, the attacker can only obtain the input and output of the model. In terms of the attacker's purpose, it can be divided into targeted attacks and non-targeted attacks, which means that the attacker wants the model to misclassify the original sample into the specified class, which is more practical, while the non-targeted attack just needs to make the model misclassify the sample. The black box setting is a scenario we will encounter in practice.	This paper presents a comprehensive review of adversarial attacks in computer vision, focusing on their application in image classification, object detection, and semantic segmentation.	This review is important because it highlights the vulnerability of deep learning models to adversarial attacks, especially in safety-critical applications like autonomous driving where robustness is paramount. It provides insights into various attack methods and their impact on different computer vision tasks, aiding researchers in developing more robust models and defense mechanisms.	The authors conduct a literature review, categorizing attack methods based on various factors such as the attacker's knowledge (white-box vs. black-box), attack goals (targeted vs. non-targeted), query efficiency, and perturbation generation techniques. They analyze each category, discuss seminal works, and explain the principles behind them. Furthermore, they delve into the application of these attack methods in object detection and semantic segmentation, highlighting specific challenges and advancements in these domains.	The paper reveals that deep neural networks, even those achieving high accuracy, are surprisingly susceptible to adversarial attacks. Key findings include the effectiveness of both white-box and black-box attacks, the existence of transferable adversarial examples that can fool multiple models, and the feasibility of universal adversarial perturbations effective across a wide range of inputs. Moreover, the paper emphasizes the increased vulnerability of object detection and semantic segmentation models due to their reliance on both classification and localization or pixel-level prediction.	The paper acknowledges the ongoing arms race between attackers and defenders, indicating that existing defense mechanisms are often bypassed by new attack strategies. It suggests future work should focus on developing more robust models, possibly incorporating insights from the human visual system, and exploring certified defenses with provable robustness guarantees. Additionally, the paper encourages research on attacks and defenses in more complex real-world scenarios, moving beyond simplified assumptions.	adversarial_attack, computer_vision, image_classification, object_detection, semantic_segmentation, white-box_attack, black-box_attack, transfer_attack, universal_adversarial_perturbation, analysis, literature_review
2308.07665	Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training	Ximing Xing, Chuang Wang, Haitao Zhou, Zhihao Hu, Chongxuan Li, Dong Xu, Qian Yu	Exemplar-based sketch-to-photo synthesis allows users to generate photo-realistic images based on sketches. Recently, diffusion-based methods have achieved impressive performance on image generation tasks, enabling highly-flexible control through text-driven generation or energy functions. However, generating photo-realistic images with color and texture from sketch images remains challenging for diffusion models. Sketches typically consist of only a few strokes, with most regions left blank, making it difficult for diffusion-based methods to produce photo-realistic images. In this work, we propose a two-stage method named ``Inversion-by-Inversion" for exemplar-based sketch-to-photo synthesis. This approach includes shape-enhancing inversion and full-control inversion. During the shape-enhancing inversion process, an uncolored photo is generated with the guidance of a shape-energy function. This step is essential to ensure control over the shape of the generated photo. In the full-control inversion process, we propose an appearance-energy function to control the color and texture of the final generated photo.Importantly, our Inversion-by-Inversion pipeline is training-free and can accept different types of exemplars for color and texture control. We conducted extensive experiments to evaluate our proposed method, and the results demonstrate its effectiveness. The code and project can be found at https://ximinng.github.io/inversion-by-inversion-project/.	This paper introduces Inversion-by-Inversion, a novel two-stage method for exemplar-based sketch-to-photo synthesis using stochastic differential equations (SDE) without training, allowing users to generate photo-realistic images guided by both a sketch and an exemplar image.	This paper is important as it addresses the challenge of generating photo-realistic images from sketches, which are inherently sparse, using pre-trained diffusion models. The proposed method effectively combines shape control from sketches with appearance control from exemplar images, advancing the field of sketch-to-photo synthesis.	The authors propose a two-stage approach: 1) Shape-enhancing inversion: An uncolored photo is generated from the input sketch using a shape-energy function to guide the SDE inversion process, emphasizing shape preservation. 2) Full-control inversion: Using the uncolored photo and an exemplar image, the final photo is generated using both shape-energy and appearance-energy functions to guide the SDE inversion process, adding color and texture from the exemplar while retaining the sketch's shape.	The paper shows that Inversion-by-Inversion outperforms existing SDE-based image translation methods in terms of FID score and shape fidelity, demonstrating its ability to generate more realistic and shape-consistent images. The method effectively uses various exemplars, including photos, stroke images, segmentation maps, and style images, showcasing its versatility. The ablation study confirms the importance of both the shape-enhancing step and the energy functions for achieving high-quality results.	The authors acknowledge that future work could explore alternative shape-energy functions and appearance-energy functions to further enhance the performance. Additionally, investigating the generalization ability of the method to handle more complex scenes and diverse sketch styles is a promising direction.	diffusion_model, sde, sketch-to-photo, exemplar-based, image_synthesis, shape_control, appearance_control, energy_function
2404.02747	Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models	Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, Jürgen Schmidhuber	This study explores the role of cross-attention during inference in text-conditional diffusion models. We find that cross-attention outputs converge to a fixed point after few inference steps. Accordingly, the time point of convergence naturally divides the entire inference process into two stages: an initial semantics-planning stage, during which, the model relies on cross-attention to plan text-oriented visual semantics, and a subsequent fidelity-improving stage, during which the model tries to generate images from previously planned semantics. Surprisingly, ignoring text conditions in the fidelity-improving stage not only reduces computation complexity, but also maintains model performance. This yields a simple and training-free method called TGATE for efficient generation, which caches the cross-attention output once it converges and keeps it fixed during the remaining inference steps. Our empirical study on the MS-COCO validation set confirms its effectiveness. The source code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.	This paper investigates the role of cross-attention in text-to-image diffusion models during inference and finds that cross-attention maps converge quickly, becoming redundant in later inference steps.	This paper is important because it challenges the assumption that cross-attention is crucial for every inference step in text-to-image diffusion models, offering a potential path to significantly reduce computational cost without sacrificing image quality.	The authors analyze the role of cross-attention by replacing text embeddings with null embeddings at various stages of the inference process. They then quantitatively evaluate the impact of this replacement on image generation quality using FID scores on the MS-COCO dataset. They also visualize the generated images at different inference steps to understand the dynamic of cross-attention.	The key findings are that cross-attention outputs converge to a fixed point early in the inference process. The authors leverage this finding to develop \textsc{Tgate}, a training-free method that caches and reuses cross-attention outputs from early inference steps, leading to reduced computational cost (up to 50% reduction in latency) and even slight improvements in FID scores compared to baseline models. Notably, \textsc{Tgate} is effective across various model architectures, including both convolutional and transformer-based diffusion models.	The authors acknowledge that while \textsc{Tgate} brings quantitative improvements in FID scores, the visual differences in generated images might be subtle for users. As for future work, the authors suggest exploring the impact of scaling token length and image resolution on the efficiency gains provided by \textsc{Tgate}, hinting at its potential benefits for the emerging trend of larger input sizes in diffusion models.	diffusion_model, cross-attention, inference, efficiency, analysis, text-to-image
2403.18551	Attention Calibration for Disentangled Text-to-Image Personalization	Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang	Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.	This paper introduces DisenDiff, a personalized text-to-image generation model that can learn multiple concepts from a single image and generate novel images with those concepts in different contexts.	The paper addresses a key limitation in existing personalized text-to-image models, which struggle to capture multiple distinct concepts from a single reference image. This is important because it allows for more flexible and creative image generation from a limited amount of input data.	The authors propose an attention calibration mechanism for a text-to-image diffusion model. This involves introducing new learnable modifiers bound to classes to capture distinct concepts and then applying constraints within the cross-attention mechanism to ensure accurate and disentangled representation of each concept.	DisenDiff outperforms state-of-the-art methods in both qualitative and quantitative evaluations, demonstrating superior image fidelity and concept disentanglement. The authors also showcase its flexibility in applications like personalized concept inpainting and integration with LoRA for enhanced texture details.	The authors acknowledge limitations in disentangling fine-grained categories within the same class (e.g., dog breeds) and handling images with more than three concepts. Future work could explore algorithms tailored to these scenarios and address the limitations of existing text-to-image models when dealing with a higher number of concepts.	diffusion_model, image_generation, personalization, attention_mechanism, disentanglement, text-to-image, inpainting, lora
2403.14572	Implicit Style-Content Separation using B-LoRA	Yarden Frenkel, Yael Vinker, Ariel Shamir, Daniel Cohen-Or	Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.	This paper presents B-LoRA, a novel method for implicit style-content separation in single images using Low-Rank Adaptation (LoRA) applied to specific transformer blocks in Stable Diffusion XL, enabling various image stylization tasks like style transfer, text-guided stylization, and consistent style generation.	B-LoRA addresses the limitations of existing image stylization techniques, including overfitting issues associated with model fine-tuning and the need for separate models for style and content. By achieving style-content separation within a single image using a lightweight adapter, it offers flexibility, efficiency, and robust stylization capabilities.	The authors analyzed SDXL's architecture to identify specific transformer blocks responsible for content and style. They then trained LoRA on these blocks (B-LoRAs) using a single input image and a general text prompt, resulting in an implicit style-content decomposition. The trained B-LoRAs can then be applied to various style manipulation tasks without additional training.	B-LoRA effectively disentangles style and content, enabling high-quality image style transfer, text-guided style manipulation, and consistent style generation even for challenging inputs like stylized images and complex scenes. Extensive qualitative and quantitative evaluations, including a user study, demonstrate its superiority over alternative approaches.	The authors acknowledge limitations such as color separation affecting identity preservation, potential style leakage from background elements in style images, and challenges with complex scenes. They suggest future work focusing on finer style-content sub-component separation and extending B-LoRA for multi-object and multi-style combinations.	diffusion_model, lora, image_stylization, style_transfer, text_guided_image_editing, analysis, sdxl
2404.03592	ReFT: Representation Finetuning for Language Models	Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts	Parameter-efficient fine-tuning (PEFT) methods seek to adapt large models via updates to a small number of weights. However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. Here, we pursue this hypothesis by developing a family of $\textbf{Representation Finetuning (ReFT)}$ methods. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). LoReFT is a drop-in replacement for existing PEFTs and learns interventions that are 10x-50x more parameter-efficient than prior state-of-the-art PEFTs. We showcase LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, Alpaca-Eval v1.0, and GLUE. In all these evaluations, LoReFT delivers the best balance of efficiency and performance, and almost always outperforms state-of-the-art PEFTs. We release a generic ReFT training library publicly at https://github.com/stanfordnlp/pyreft.	This paper introduces ReFT, a novel parameter-efficient fine-tuning method that modifies model representations through learned interventions, outperforming weight-based methods like LoRA in efficiency and achieving state-of-the-art performance on various NLP tasks.	This paper is important because it challenges the prevailing focus on weight-based PEFTs, proposing a more efficient and interpretable approach by leveraging the rich semantic information encoded in model representations. This approach opens up new possibilities for controlling and understanding large language models.	The authors develop ReFT, a method that learns low-rank interventions on model representations, inspired by causal abstraction and distributed interchange interventions. They evaluate ReFT on four diverse NLP benchmarks, including commonsense reasoning, arithmetic reasoning, instruction-following, and natural language understanding, comparing its performance and efficiency against existing PEFT methods like LoRA, Adapters, and Prefix-tuning.	ReFT significantly outperforms previous PEFT methods on commonsense reasoning, instruction-following, and natural language understanding benchmarks, achieving state-of-the-art results while using 10-50 times fewer parameters than LoRA. It also demonstrates strong performance on arithmetic reasoning tasks, surpassing Prefix-tuning. Furthermore, the paper explores the memorization capabilities of ReFT, showing that a single low-rank intervention can store a surprisingly large amount of information, and provides evidence for the superposition of token identities in model representations.	The authors acknowledge limitations in terms of model diversity, primarily exploring LLaMA-family models. Future work could investigate ReFT's effectiveness on other model families like Mistral or GPT. Further exploration of ReFT's design space, including automating the hyperparameter search and developing more effective interventions for specific tasks like arithmetic reasoning, is also suggested. Additionally, the authors highlight the need for more robust evaluation practices in PEFT research, advocating for benchmarks that prevent test-set hill-climbing and allow for fair comparisons.	diffusion_model, llm, analysis, interpretability
2401.00110	Diffusion Model with Perceptual Loss	Shanchuan Lin, Xiao Yang	Diffusion models trained with mean squared error loss tend to generate unrealistic samples. Current state-of-the-art models rely on classifier-free guidance to improve sample quality, yet its surprising effectiveness is not fully understood. In this paper, we show that the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance. As a result, we can directly incorporate perceptual loss in diffusion training to improve sample quality. Since the score matching objective used in diffusion training strongly resembles the denoising autoencoder objective used in unsupervised training of perceptual networks, the diffusion model itself is a perceptual network and can be used to generate meaningful perceptual loss. We propose a novel self-perceptual objective that results in diffusion models capable of generating more realistic samples. For conditional generation, our method only improves sample quality without entanglement with the conditional input and therefore does not sacrifice sample diversity. Our method can also improve sample quality for unconditional generation, which was not possible with classifier-free guidance before.	This paper proposes a novel "self-perceptual" training objective for diffusion models that leverages the model itself as a perceptual network to improve the realism of generated images.	This paper addresses the limitations of relying on classifier-free guidance for improving sample quality in diffusion models by introducing a method that enhances realism without sacrificing diversity, works for both conditional and unconditional generation, and is integrated directly into the training process.	The authors propose a "self-perceptual" objective where a frozen copy of the diffusion model, trained with a standard MSE loss, acts as a perceptual network. During training, the online model generates an image, both images are passed through the perceptual network at a randomly sampled timestep, and the MSE loss between their hidden features is backpropagated to the online model.	The self-perceptual objective demonstrably improves the realism of generated images, both qualitatively and quantitatively (FID, IS), compared to models trained solely with MSE loss, particularly in unconditional image generation. However, it doesn't yet surpass the performance of classifier-free guidance combined with MSE loss for text-to-image generation.	The authors acknowledge that the self-perceptual objective currently doesn't outperform classifier-free guidance in text-to-image generation. Additionally, they identify grid-like artifacts in the generated images as an area for future investigation. Future work could focus on refining the perceptual loss mechanism, exploring alternative distance functions, and addressing the identified artifacts.	diffusion_model, perceptual_loss, image_generation, unconditional_generation, classifier-free_guidance
2311.17035	Scalable Extraction of Training Data from (Production) Language Models	Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee	This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.	This paper investigates "extractable memorization" in large language models, focusing on the ability of adversaries to extract training data from these models without prior knowledge of the training set.	The paper highlights the significant privacy implications of training large language models, demonstrating that even aligned models like ChatGPT can leak substantial amounts of training data, including personally identifiable information (PII). This raises concerns about the security of training data and the effectiveness of current alignment techniques in preventing memorization.	The authors develop a scalable methodology to detect memorization in large language models by matching model outputs against publicly available web-scale datasets using suffix arrays. For aligned models like ChatGPT, they introduce a novel "divergence" attack that prompts the model to deviate from its conversational style and emit training data at a much higher rate. They also employ a Good-Turing estimator to extrapolate total memorization based on the rate of unique memorized outputs.	The authors find that all models, including open-source, semi-closed, and closed (API-based) models, exhibit extractable memorization. Larger and more capable models are more vulnerable to data extraction attacks. Notably, their divergence attack on ChatGPT reveals that it is significantly more susceptible to memorization than previously thought, leaking gigabytes of training data, including PII. They also find that certain words are more effective at eliciting memorized outputs during the divergence attack. The study demonstrates that current alignment techniques do not eliminate memorization and that discoverable memorization is a useful but not perfect proxy for extractable memorization.	The authors acknowledge that their analysis may underestimate the true memorization rate due to limitations in the size and coverage of their auxiliary dataset. They also note that their attack on ChatGPT is specific to this model and may not generalize to other aligned chatbots. Future work could investigate the effectiveness of data deduplication techniques in mitigating memorization, explore the relationship between model capacity and memorization, and develop more generalizable attacks to assess the privacy of black-box RLHF-aligned models.	llm, analysis, memorization, privacy, data_extraction, alignment, chatgpt, divergence_attack, suffix_array, good-turing_estimator, pii
2404.04095	Dynamic Prompt Optimizing for Text-to-Image Generation	Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen, Qing Yang	Text-to-image generative models, specifically those based on diffusion models like Imagen and Stable Diffusion, have made substantial advancements. Recently, there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However, the success of fine-control prompts depends on the accuracy of the text prompts and the careful selection of weights and time steps, which requires significant manual intervention. To address this, we introduce the \textbf{P}rompt \textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original prompts for image generation, we further employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts. The reward function during training encourages the model to consider aesthetic score, semantic consistency, and user preferences. Experimental results demonstrate that our proposed method effectively improves the original prompts, generating visually more appealing images while maintaining semantic alignment. Code is available at https://github.com/Mowenyii/PAE.	This paper introduces PAE, a novel two-stage framework employing reinforcement learning to automatically edit and refine text prompts for diffusion-based text-to-image synthesis, enhancing both image quality and alignment with user intent.	This work addresses the challenge of manual prompt engineering in text-to-image generation. It enables fine-grained control over image generation by dynamically adjusting word importance and injection time steps in the diffusion process, leading to higher-quality images that better reflect user preferences.	The authors propose a two-stage training process: 1) Fine-tuning a language model on a curated text-image dataset to refine initial prompts. 2) Using online reinforcement learning to optimize a policy model, which learns to add modifiers with specific effect ranges and weights to the refined prompts, guided by a reward function that considers aesthetic quality, semantic consistency, and user preference.	PAE generates higher-quality images compared to using short prompts or prompts generated by other methods, evidenced by improved aesthetic scores, CLIP scores, and PickScores. The method demonstrates robust performance on both in-domain and out-of-domain datasets, highlighting its versatility and generalization ability. The learned policy model exhibits a preference for adding modifiers related to art trends, styles, and textures, leading to more visually appealing results without significantly altering the prompt's original meaning.	The authors acknowledge limitations regarding potential for attribute leakage and missing objects, suggesting the incorporation of control attention maps into the action space for finer control over the generation process as future work. Further improvements could involve integrating additional reward considerations like high resolution and proportional composition to enhance image quality and realism. The paper also suggests exploring techniques to ensure consistent role generation building upon the model's capability to maintain identity consistency.	diffusion_model, text-to-image, prompt_engineering, reinforcement_learning, aesthetic_quality, semantic_consistency, user_preference
2308.09991	AltDiffusion: A Multilingual Text-to-Image Diffusion Model	Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu	Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson.	This paper introduces AltDiffusion, a novel multilingual text-to-image diffusion model capable of generating images from prompts in eighteen different languages.	This paper is important because it addresses the language limitations of existing text-to-image models, making them accessible to a wider global audience and improving their ability to understand and generate images from prompts with culture-specific concepts.	The authors first train a multilingual text encoder using knowledge distillation from a pre-trained English CLIP model. This encoder is then integrated into a pre-trained English diffusion model and fine-tuned using a two-stage training schema. The first stage aligns the text encoder and the diffusion model's embedding space, while the second stage focuses on improving the quality of generated images using a high-quality multilingual dataset and classifier-free guidance.	AltDiffusion outperforms existing multilingual text-to-image models in terms of both image quality and multilingual understanding, especially on culture-specific concepts. It achieves comparable results to the English Stable Diffusion model on general prompts and exhibits better performance in understanding and generating images from prompts containing culture-specific concepts.	The paper does not explicitly mention limitations, but future work could explore expanding the model to support more languages, improving the generation quality for certain languages, and further evaluating the model's capabilities in different downstream applications.	diffusion_model, multilingual, text-to-image, culture-specific, knowledge_distillation
2403.16627	SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions	Yuda Song, Zehao Sun, Xuanwu Yin	Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.	This paper introduces SDXS, a novel approach to distill large-scale diffusion models for text-to-image generation into efficient models capable of real-time inference on GPUs, achieving speeds of up to 100 FPS for 512x512 images and 30 FPS for 1024x1024 images.	This work is important as it addresses the limitations of traditional diffusion models, which suffer from slow inference speeds due to their multi-step sampling process, hindering their deployment on edge devices or applications requiring real-time performance.	The authors employ a dual approach: 1) Model miniaturization: Knowledge distillation is used to compress the U-Net and image decoder architectures. 2) One-step training: A novel training technique combines feature matching and score distillation to reduce the sampling process to a single step.	The resulting models, SDXS-512 and SDXS-1024, demonstrate significant speed improvements (30x and 60x faster than their base counterparts) while maintaining comparable image quality. Furthermore, the proposed method can be adapted for image-conditioned generation tasks using ControlNet, enabling applications like image-to-image translation.	The authors acknowledge limitations in image diversity when using ControlNet for image-to-image translation. Future work will focus on improving diversity and exploring applications like inpainting and super-resolution, particularly on edge devices.	diffusion_model, knowledge_distillation, one-step_training, real-time_inference, text-to-image, image-to-image, controlnet, latency_optimization
2311.17086	PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation	Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu	Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation. Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion	This paper introduces PEA-Diffusion, a novel method using a plug-and-play adapter and knowledge distillation to adapt English-based text-to-image diffusion models for non-English languages and culture-specific image generation.	This paper is important because it addresses the limitations of current text-to-image models that primarily focus on English, making them accessible to non-English speakers and enabling the generation of culturally relevant images.	The authors propose PEA-Diffusion, which uses a lightweight MLP adapter and knowledge distillation from a pre-trained English diffusion model (Stable Diffusion) to guide the learning of a non-English counterpart. They freeze the parameters of the original model, train the adapter with a small parallel corpus, and employ a hybrid training strategy that leverages both parallel and culture-specific image-text pairs.	PEA-Diffusion achieves significant improvements over baseline methods like translation, AltDiffusion, and GlueGen, particularly in generating culturally relevant images. It demonstrates superior performance on CLIPScore for culture-specific prompts, retains strong performance on general prompts, and exhibits low training costs and plug-and-play capabilities with other downstream tasks like LoRA, ControlNet, and Inpainting.	The paper acknowledges limitations in the performance of language-specific CLIP encoders, potentially hindering the model's generalizability. Additionally, the approach is limited by the capabilities of the base English model. Future work aims to address these limitations and explore further improvements in both general and culture-specific image generation.	diffusion_model, language_transfer, knowledge_distillation, multilingual, text-to-image, culture-specific, adapter, parameter-efficient
2312.03766	Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment	Brian Gordon, Yonatan Bitton, Yonatan Shafir, Roopal Garg, Xi Chen, Dani Lischinski, Daniel Cohen-Or, Idan Szpektor	While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/	This paper presents a method to explain the misalignment between text and images in image-text alignment models by leveraging LLMs and visual grounding models to generate plausible misaligned captions and their corresponding textual and visual explanations.	This paper is important because it addresses the limitation of existing image-text alignment models which only provide a binary assessment of alignment and fail to pinpoint the source of misalignment. The proposed method enables detailed understanding of misalignment causes and facilitates the development of better image-text alignment models.	The authors propose a method called Mismatch-Quest which first collects aligned image-text pairs from various datasets, then utilizes LLMs to generate misaligned captions along with their textual and visual explanations. To ensure quality, they validate the generated captions and feedback using entailment models and utilize a visual grounding model to annotate the misalignments with bounding boxes.	The authors create a comprehensive training set named TV-Feedback with 3 million instances. They also introduce a human-annotated test set named Mismatch-Quest Benchmark with 2,008 instances. Fine-tuning PaLI vision language models on TV-Feedback outperforms other baselines on both binary alignment classification and explanation generation tasks, achieving over 10% improvement in alignment accuracy and 20% in textual feedback entailment.	The authors identify limitations like failing to handle scenarios with no visual feedback expected and struggling with instances requiring identification of multiple misalignments. Future work includes enriching the training set with such scenarios to improve the model's ability to address diverse misalignment types.	image-text alignment, llm, visual grounding, misalignment explanation, dataset, analysis, evaluation
2311.13600	ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs	Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, Varun Jampani	Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity. We propose ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. Experiments on a wide range of subject and style combinations show that ZipLoRA can generate compelling results with meaningful improvements over baselines in subject and style fidelity while preserving the ability to recontextualize. Project page: https://ziplora.github.io	This paper introduces ZipLoRA, a novel optimization-based method for merging independently trained style and content LoRAs (Low-Rank Adaptations) for text-to-image diffusion models. This allows for the generation of any user-provided subject in any user-provided style, enabling personalized and stylized image creation.	This paper is important because it addresses a key limitation in existing text-to-image generation models: the ability to combine specific subjects with specific styles in a controllable and efficient manner. It achieves this by efficiently merging independently trained LoRAs, allowing for versatile and personalized image generation while preserving the subject's identity and desired style.	The authors leverage two key insights: (1) sparsity of LoRA weight update matrices and (2) poor performance of directly merging highly aligned LoRA weights. They propose an optimization method that learns to merge style and content LoRAs by minimizing a loss function that encourages both style and subject fidelity while minimizing signal interference between the two LoRAs.	ZipLoRA demonstrates superior performance compared to direct merging, joint training, and StyleDrop methods. It shows impressive results in generating stylized images while preserving subject fidelity and allows for control over the extent of stylization. The method also retains the ability to generate individual concepts (subject or style) accurately, demonstrating its versatility. User studies and quantitative metrics further highlight ZipLoRA's effectiveness in achieving personalized stylizations.	The authors do not explicitly mention limitations. However, potential areas for future work could include exploring: (1) extension of ZipLoRA to handle multiple styles or subjects, (2) exploring alternative optimization strategies or regularization techniques for more robust merging, and (3) investigating the application of ZipLoRA to other diffusion-based generative tasks beyond image stylization.	diffusion_model, lora, stylization, personalization, image_generation, text-to-image, sdxl, dreambooth, styledrop
2402.01103	Compositional Generative Modeling: A Single Model is Not All You Need	Yilun Du, Leslie Kaelbling	Large monolithic generative models trained on massive amounts of data have become an increasingly dominant approach in AI research. In this paper, we argue that we should instead construct large generative systems by composing smaller generative models together. We show how such a compositional generative approach enables us to learn distributions in a more data-efficient manner, enabling generalization to parts of the data distribution unseen at training time. We further show how this enables us to program and construct new generative models for tasks completely unseen at training. Finally, we show that in many cases, we can discover separate compositional components from data.	This paper argues for a compositional approach to generative modeling, proposing the construction of large generative systems by composing smaller generative models instead of relying solely on large monolithic models.	This paper is important because it addresses limitations of current large generative models, such as poor compositionality, data inefficiency, and difficulty in adaptation. The proposed compositional approach offers a more scalable, data-efficient, and generalizable alternative.	The authors present a theoretical framework for compositional generative modeling and illustrate its benefits in various domains including image synthesis, trajectory modeling, and planning. They demonstrate how composing simpler models can represent complex distributions more effectively, generalize to unseen data regions, and enable the construction of new generative models for unseen tasks. They also discuss methods for discovering compositional components from data.	The paper shows that compositional models are more data-efficient, generalize better to unseen data, and can be composed to solve new tasks. For example, composing models trained on different subsets of data allows for generating hybrid scenes with elements from each subset. Additionally, the paper demonstrates how compositional models can be used for planning, constraint satisfaction, and style adaptation in video generation.	The paper acknowledges limitations in implementing compositional sampling with common generative model parameterizations and suggests using Energy-Based Models (EBMs) as a solution. Future work includes developing efficient methods for sampling from joint distributions, discovering compositional structures, and dynamically adapting the structure of generative models under distribution shift.	generative_modeling, modularity, compositionality, ebm, diffusion_model, analysis, video, image, planning
2308.10187	Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks	Mingxuan Liu, Jie Gan, Rui Wen, Tao Li, Yongli Chen, Hong Chen	Spiking neural networks (SNNs) have tremendous potential for energy-efficient neuromorphic chips due to their binary and event-driven architecture. SNNs have been primarily used in classification tasks, but limited exploration on image generation tasks. To fill the gap, we propose a Spiking-Diffusion model, which is based on the vector quantized discrete diffusion model. First, we develop a vector quantized variational autoencoder with SNNs (VQ-SVAE) to learn a discrete latent space for images. In VQ-SVAE, image features are encoded using both the spike firing rate and postsynaptic potential, and an adaptive spike generator is designed to restore embedding features in the form of spike trains. Next, we perform absorbing state diffusion in the discrete latent space and construct a spiking diffusion image decoder (SDID) with SNNs to denoise the image. Our work is the first to build the diffusion model entirely from SNN layers. Experimental results on MNIST, FMNIST, KMNIST, Letters, and Cifar10 demonstrate that Spiking-Diffusion outperforms the existing SNN-based generation model. We achieve FIDs of 37.50, 91.98, 59.23, 67.41, and 120.5 on the above datasets respectively, with reductions of 58.60\%, 18.75\%, 64.51\%, 29.75\%, and 44.88\% in FIDs compared with the state-of-art work. Our code will be available at \url{https://github.com/Arktis2022/Spiking-Diffusion}.	This paper introduces Spiking-Diffusion, a novel generative model for image generation that utilizes spiking neural networks (SNNs) to achieve both energy efficiency and biological plausibility.	This paper is significant because it is the first to successfully implement a diffusion model entirely using SNN layers, opening up new possibilities for energy-efficient and brain-inspired image generation. Previous SNN-based generative models faced limitations in quality and capacity, making this a notable advancement in the field.	The authors develop Spiking-Diffusion in two stages: 1) VQ-SVAE: They create a Vector Quantized Spiking Variational Autoencoder to learn discrete latent representations of images. This involves encoding image features using spike firing rate (SFR) and postsynaptic potential (PSP), and designing an adaptive spike generator (ASG) to convert embeddings back into spike trains for the decoder. 2) SDID: They employ a Spiking Diffusion Image Decoder trained on the discrete latent space. They utilize an absorbing state diffusion process, gradually masking the discrete image representation, and the SDID learns to reverse this process, effectively denoising the image.	Spiking-Diffusion outperforms the current state-of-the-art SNN-based generative model (FSVAE) on various image datasets, including MNIST, FMNIST, KMNIST, Letters, and Cifar10. It demonstrates lower reconstruction error (MSE, SSIM) and better-generated image quality (FID, KID).	The paper acknowledges the need to explore the training of larger-scale SNN generative models in future work. This suggests scaling up the model and exploring more complex datasets to further validate and improve Spiking-Diffusion's capabilities.	diffusion_model, gan, snn, image_generation, vq-vae, neuromorphic, energy_efficient, biological_plausibility
2401.15708	Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding	Jianxiang Lu, Cong Xie, Hui Guo	As large-scale text-to-image generation models have made remarkable progress in the field of text-to-image generation, many fine-tuning methods have been proposed. However, these models often struggle with novel objects, especially with one-shot scenarios. Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way, using only a single input image and the object-specific regions of interest. To improve generalizability and mitigate overfitting, in our paradigm, a prototypical embedding is initialized based on the object's appearance and its class, before fine-tuning the diffusion model. And during fine-tuning, we propose a class-characterizing regularization to preserve prior knowledge of object classes. To further improve fidelity, we introduce object-specific loss, which can also use to implant multiple objects. Overall, our proposed object-driven method for implanting new objects can integrate seamlessly with existing concepts as well as with high fidelity and generalization. Our method outperforms several existing works. The code will be released.	This paper presents a novel object-driven one-shot fine-tuning method for text-to-image diffusion models, enabling the generation of diverse images with specific objects from a single input image and region of interest.	This paper is significant because it addresses the challenges of limited data and object fidelity in personalized text-to-image generation. It allows for efficient object implantation and diverse image synthesis with high fidelity using only one reference image, advancing the field of content creation.	The authors leverage prototypical embedding for initialization, class-characterizing regularization to preserve class diversity, and an object-specific loss function to enhance fidelity. They fine-tune a pre-trained stable diffusion model using a single image and its object mask, and compare their method with existing techniques through qualitative and quantitative evaluations.	The proposed method outperforms existing one-shot fine-tuning methods in terms of both object fidelity and generalization ability. It effectively mitigates overfitting and allows for the generation of diverse images with the target object while maintaining consistency with text prompts. The method also demonstrates success in multi-object implantation, enabling the creation of compositions with user-specified objects.	The authors acknowledge limitations in handling objects with complex edges, which can lead to degraded image quality. They also point out that smaller objects may have reduced fidelity in the generated images. Future work will focus on improving mask acquisition methods and incorporating multi-scale perception mechanisms for objects to address these limitations.	diffusion_model, one-shot, fine-tuning, text-to-image, prototypical_embedding, object-driven, fidelity, generalization
2311.10329	High-fidelity Person-centric Subject-to-Image Synthesis	Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin	Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.	This paper introduces Face-diffuser, a novel collaborative generation pipeline for subject-driven text-to-image generation that addresses limitations of existing methods in person-centric image synthesis by employing two specialized diffusion models for enhanced scene and person generation.	This paper is important because it tackles the training imbalance and quality compromise issues prevalent in current subject-driven image generation models, especially for person-centric synthesis. Face-diffuser's innovative approach enhances the fidelity of generated persons within diverse semantic scenes, advancing the field of personalized image generation.	The authors propose Face-diffuser, which utilizes two pre-trained diffusion models: TDM for scene generation and SDM for person generation. The generation process involves three stages: initial scene construction using TDM, subject-scene fusion through a novel Saliency-adaptive Noise Fusion (SNF) mechanism, and final subject enhancement by SDM. SNF leverages classifier-free guidance responses to dynamically allocate regions for each model's contribution during synthesis, enabling seamless collaboration.	Face-diffuser demonstrates superior performance in both single- and multi-subject generation tasks, quantitatively outperforming state-of-the-art methods in terms of identity preservation and prompt consistency. Qualitative results showcase its ability to generate high-fidelity, coherent images of individuals within diverse contexts, surpassing baselines in preserving subject details and scene semantics. Ablation studies confirm the efficacy of each stage in the pipeline and the superiority of SNF over simpler fusion techniques.	Limitations include the potential for privacy concerns due to the close resemblance of generated persons to reference images and challenges in editing attributes of generated individuals. Future work aims to address these limitations and explore attribute editing capabilities.	diffusion_model, image_generation, subject-driven, person-centric, saliency, collaborative_generation
2404.03673	RL for Consistency Models: Faster Reward Guided Text-to-Image Generation	Owen Oertell, Jonathan D. Chang, Yiyi Zhang, Kianté Brantley, Wen Sun	Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. RLCM improves upon RL fine-tuned diffusion models on text-to-image generation capabilities and trades computation during inference time for sample quality. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Our code is available at https://rlcm.owenoertell.com	This paper introduces RLCM, a novel framework for enhancing text-to-image consistency models by leveraging reinforcement learning to optimize for specific reward functions, resulting in faster training and inference compared to diffusion models.	The paper addresses limitations in text-to-image generation using diffusion models, such as difficulty in aligning with specific prompts and slow inference speed. It leverages consistency models, which offer faster generation, and proposes an RL-based approach to fine-tune them for better alignment with downstream tasks.	The authors formulate the iterative inference of a consistency model as a Markov Decision Process (MDP) with a shorter horizon compared to diffusion models. They utilize a policy gradient algorithm, RLCM, to optimize the consistency model's policy by maximizing rewards associated with desired image properties. Experiments compare RLCM to DDPO (an RL method for diffusion models) on tasks like image compressibility, aesthetics, and prompt alignment.	RLCM demonstrates faster training and inference than DDPO while achieving comparable or better image quality across various tasks. Notably, RLCM shows a 17x speedup in training time on the aesthetic task. Ablation studies highlight the trade-off between inference time and image quality achievable by adjusting the number of inference steps in RLCM.	The authors acknowledge limitations such as the use of sparse rewards in the current policy gradient method and suggest exploring dense reward strategies. Future work could also focus on developing loss functions that reinforce consistency, potentially further improving inference speed.	diffusion_model, consistency_model, rl, text-to-image, inference, optimization, aesthetic, image_generation
2402.14792	Consolidating Attention Features for Multi-view Image Editing	Or Patashnik, Rinon Gal, Daniel Cohen-Or, Jun-Yan Zhu, Fernando De la Torre	Large-scale text-to-image models enable a wide range of image editing techniques, using text prompts or even spatial controls. However, applying these editing methods to multi-view images depicting a single scene leads to 3D-inconsistent results. In this work, we focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We build on two insights: (1) maintaining consistent features throughout the generative process helps attain consistency in multi-view editing, and (2) the queries in self-attention layers significantly influence the image structure. Hence, we propose to improve the geometric consistency of the edited images by enforcing the consistency of the queries. To do so, we introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. Once trained, QNeRF can render 3D-consistent queries, which are then softly injected back into the self-attention layers during generation, greatly improving multi-view consistency. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps. We compare our method to a range of existing techniques and demonstrate that it can achieve better multi-view consistency and higher fidelity to the input scene. These advantages allow us to train NeRFs with fewer visual artifacts, that are better aligned with the target geometry.	This paper introduces a method for consistent multi-view image editing, focusing on geometric manipulations like articulations and shape changes using spatial controls and a novel query feature space neural radiance field called QNeRF.	This work addresses the limitations of existing multi-view image editing techniques that struggle with consistent geometric modifications across multiple views, offering a solution for more realistic and high-fidelity edits.	The authors leverage ControlNet and a pre-trained Stable Diffusion model to edit images based on spatial controls. They introduce QNeRF, trained on query features from self-attention layers, to progressively consolidate these features during denoising, ensuring consistency across views.	The proposed method achieves greater visual quality and consistency in multi-view edits compared to baseline methods like InstructNeRF2NeRF and TokenFlow, as demonstrated through qualitative results, KID and FID scores, and user preference evaluations. It allows for training NeRFs with fewer artifacts and better alignment to the target geometry.	Limitations include difficulties in generating highly detailed structures like hands, potential for hallucinating inconsistent details in complex objects, and reliance on a black-box optimizer for QNeRF training. Future work could explore robust statistics for QNeRF optimization, alternative 3D representations like Gaussian Splats, and addressing the limitations inherited from text-to-image models.	diffusion_model, nerf, 3d, multi-view, image_editing, geometric_editing, consistency, self-attention
2402.16828	Training Neural Networks from Scratch with Parallel Low-Rank Adapters	Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal	The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.	This paper introduces LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm for training neural networks from scratch using parallel low-rank adapters, addressing the limitations of standard low-rank adaptation in model pre-training.	This paper is important because it tackles the challenge of pre-training large models with limited computing resources by leveraging low-rank adaptations, potentially enabling training on less powerful devices and reducing communication bottlenecks.	The authors propose LTE, which trains multiple low-rank adapter heads in parallel on different data shards with infrequent synchronization, merging the updates to the main model weights periodically, and reducing communication overhead.	LTE demonstrates competitive performance compared to standard pre-training across various vision tasks and datasets, achieving comparable accuracy with potential for memory and communication efficiency.	Limitations include slower convergence in the later stages of training and the need for further investigation into optimal hyperparameter selection, such as rank and number of heads. Future work involves exploring dynamic rank and head allocation, heterogeneous LoRA parameterization, and advanced merging strategies.	diffusion_model, llm, analysis, 3d, motion, video, interpretability
2310.14729	MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion	Roy Kapon, Guy Tevet, Daniel Cohen-Or, Amit H. Bermano	We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation, using 2D diffusion models that were trained on motions obtained from in-the-wild videos. As such, MAS opens opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. It ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the Score Distillation approach, which optimizes each sample by repeatedly applying small fixes, our method uses a sampling process that was constructed for the diffusion framework. As we demonstrate, MAS avoids common issues such as out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/	This paper introduces Multi-view Ancestral Sampling (MAS), a novel method for generating 3D human and animal motions using 2D diffusion models trained on in-the-wild videos.	This research is significant as it allows for 3D motion generation in domains where acquiring 3D data is expensive or impractical, such as basketball, horse racing, and rhythmic gymnastics.	The authors first train a 2D motion diffusion model on poses extracted from videos. Then, they utilize MAS, which simultaneously generates multiple 2D views of a 3D motion via ancestral sampling, ensuring consistency across views by triangulating the generated 2D poses into a 3D motion at each denoising step.	MAS successfully generates diverse and realistic 3D motions, outperforming existing pose lifting methods and a DreamFusion adaptation for unconditional motion generation. The method's reliance on ancestral sampling results in faster generation times and avoids common issues like out-of-distribution sampling and mode collapse.	Limitations include occasional character self-intersection and scale inconsistencies. Future work could address predicting global position, enabling textual control, and extending the method to multi-person interactions, hand and face motions, and complex object manipulations.	diffusion_model, 3d, motion, video, analysis, motion_generation
2312.02663	FaceStudio: Put Your Face Everywhere in Seconds	Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, Bin Fu	This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.	This paper introduces a novel, tuning-free method for identity-preserving image synthesis, focusing on efficiently generating human images in various styles while maintaining individual identities using a hybrid guidance framework combining style images, facial images, and text prompts.	This paper addresses limitations in existing identity-preserving image synthesis methods, which often require resource-intensive fine-tuning and multiple reference images. The proposed method offers a faster and more efficient alternative by using a direct feed-forward approach and hybrid guidance, enabling diverse applications like artistic portrait creation and identity blending.	The authors develop a hybrid guidance framework that combines style images, facial images, and text prompts to guide a latent diffusion model. They extract identity features from facial images using Arcface and combine them with text embeddings from a prior model trained to map CLIP text embeddings to vision embeddings. A multi-identity cross-attention mechanism is introduced to handle multiple identities within a single image, ensuring each individual's features are correctly mapped. The model is trained on a human image reconstruction task, using masked images as style input and cropped faces as identity input.	The proposed method demonstrates superior performance in preserving identities during image synthesis compared to baseline models like DreamBooth and Textual Inversion, achieving higher face similarity scores in both single- and multi-image settings. The ablation study confirms the significance of the identity input for maintaining identity fidelity. The model also exhibits strong performance in novel view synthesis, effectively generating images with large pose changes while preserving identity. Furthermore, the method demonstrates successful identity mixing and multi-human image generation with accurate identity mapping.	The authors acknowledge that compared to methods like DreamBooth, their model is currently limited to human image generation. As future work, they plan to extend its capabilities to encompass a wider range of subjects, including animals and objects. Additionally, they recognize the ethical considerations and potential for misuse, such as copyright infringement and the creation of inappropriate content. The authors emphasize the importance of responsible use and the establishment of guidelines to mitigate these risks.	diffusion_model, image_synthesis, identity_preserving, hybrid_guidance, text-to-image, multi-identity, tuning-free, face_recognition, novel_view_synthesis
2404.05729	Finding Visual Task Vectors	Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar	Visual Prompting is a technique for teaching models to perform a visual task via in-context examples, without any additional training. In this work, we analyze the activations of MAE-VQGAN, a recent Visual Prompting model, and find task vectors, activations that encode task-specific information. Equipped with this insight, we demonstrate that it is possible to identify the task vectors and use them to guide the network towards performing different tasks without providing any input-output examples. To find task vectors, we compute the average intermediate activations per task and use the REINFORCE algorithm to search for the subset of task vectors. The resulting task vectors guide the model towards performing a task better than the original model without the need for input-output examples.	This paper investigates the existence and identification of "task vectors" in visual prompting models, specifically focusing on MAE-VQGAN. The authors propose a method to identify these task-specific activations and demonstrate that patching them into the model enables zero-shot task performance comparable to or exceeding the original one-shot in-context learning.	This paper is significant as it sheds light on the inner workings of visual in-context learning, a relatively new and less understood area compared to its NLP counterpart. Identifying and leveraging task vectors could lead to more efficient and adaptable visual prompting models, reducing the reliance on extensive in-context examples.	The authors first analyze MAE-VQGAN activations to identify potential task vectors by measuring their variance across different tasks and invariance within a task. Then, they employ a REINFORCE algorithm to search for the optimal subset of task vectors that minimize the task-specific loss when patched into the model. They evaluate their method on various image-to-image tasks using the Pascal-5i dataset.	The paper shows that task vectors do exist in visual prompting models and can be effectively identified. Patching the identified task vectors allows MAE-VQGAN to perform tasks in a zero-shot manner, achieving comparable or even superior performance to the original one-shot prompting on tasks like foreground segmentation, low-light enhancement, in-painting, and colorization. The results also suggest that task vectors are distributed throughout the encoder and decoder of the network.	The authors acknowledge limitations in exploring other potential vector types, such as those encoding image structure and positional information. They also point to the possibility of directly evaluating the model in the VQGAN token space for potentially more accurate results. Future work could involve investigating these aspects further, as well as exploring the generalization of task vectors across different datasets and models.	diffusion_model, visual_prompting, in-context_learning, analysis, task_vectors, zero-shot, mae, vqgan, attention, reinforce
2310.12274	An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning	Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare	Textural Inversion, a prompt learning method, learns a singular embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying and integrating multiple object-level concepts within one scene poses significant challenges even when embeddings for individual concepts are attainable. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new "words" are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularisation techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new "words" with known words. We evaluate via image generation, editing, and attention visualisation with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.	This paper introduces Multi-Concept Prompt Learning (MCPL), a method for learning multiple textural embeddings (new "words") in text-to-image diffusion models, which represent distinct object-level concepts within a single image.	This paper addresses a significant limitation in existing textural inversion techniques, which struggle to learn and compose multiple concepts from a single image, hindering their application in complex multi-object editing and generation tasks.	The authors propose MCPL, building upon Textural Inversion, to jointly learn multiple embeddings by optimizing the diffusion model loss on a single image with multiple learnable prompts. To enhance object-level concept learning, they introduce three regularization techniques: Attention Masking to focus learning on relevant image regions, Prompts Contrastive Loss to separate embeddings of different concepts, and binding learnable prompts with adjectives to leverage pre-trained knowledge.	Experiments on natural and biomedical image datasets demonstrate that MCPL, particularly with all the proposed regularizations, effectively learns disentangled object-level embeddings, outperforming existing techniques in terms of concept separation and fidelity to both text prompts and image regions. The approach enables more accurate object-level synthesis, editing, and understanding of multi-object relationships.	The paper acknowledges limitations in the estimation of "ground truth" embeddings using masks and suggests exploring alternative evaluation metrics beyond those used for single-concept learning. Future work includes exploring better prompt selection strategies and extending MCPL to handle a larger number of concepts within a scene.	diffusion_model, textural_inversion, prompt_learning, multi-concept, object-level, attention_mechanism, contrastive_learning, image_generation, image_editing, disentanglement
2311.12908	Diffusion Model Alignment Using Direct Preference Optimization	Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik	Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.	This paper introduces Diffusion-DPO, a new method for aligning text-to-image diffusion models with human preferences by directly optimizing the model on pairwise comparison data, adapting the Direct Preference Optimization (DPO) technique from language models.	This paper is significant because it bridges the gap in aligning diffusion models to human preferences, similar to advancements made with Large Language Models (LLMs), leading to improved visual appeal and text alignment in generated images.	The authors adapted DPO to diffusion models by defining a notion of data likelihood under the model and using the evidence lower bound (ELBO) to derive a differentiable objective. They demonstrate Diffusion-DPO by fine-tuning state-of-the-art text-to-image diffusion models like Stable Diffusion XL (SDXL) on the Pick-a-Pic dataset, and evaluating performance through human evaluation and automated metrics.	Diffusion-DPO significantly improves both visual appeal and prompt alignment in generated images, outperforming even the larger SDXL model with a refinement stage. The authors also demonstrate the effectiveness of learning from AI feedback using Diffusion-DPO, offering a potential for scaling this alignment method.	Limitations include ethical considerations related to potential biases in web-collected data and user preferences. Future work involves dataset cleaning and scaling, online learning methods for DPO, and personalized tuning for individual or group preferences.	diffusion_model, dpo, alignment, human_preference, image_generation, ai_feedback, stable_diffusion, sdxl
2310.10971	Context-Aware Meta-Learning	Christopher Fifty, Dennis Duan, Ronald G. Junkins, Ehsan Amid, Jure Leskovec, Christopher Re, Sebastian Thrun	Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts visual meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks. Our code is available at https://github.com/cfifty/CAML.	This paper introduces Context-Aware Meta-Learning (CAML), a novel meta-learning algorithm for few-shot image classification that draws inspiration from in-context learning in Large Language Models (LLMs) to learn new visual concepts during inference without fine-tuning.	This paper is important because it addresses the limitations of existing visual meta-learning algorithms that are either slow due to fine-tuning requirements or exhibit poor generalization to unseen tasks. The proposed CAML method offers a promising solution for real-time and generalizable few-shot image classification, potentially unlocking new applications in computer vision similar to the advancements in natural language processing enabled by in-context learning in LLMs.	The authors propose a novel meta-learning algorithm, CAML, that leverages a frozen pre-trained feature extractor, an Equal Length and Maximally Equiangular Set (ELMES) class encoder, and a non-causal sequence model. The method encodes images and labels, forming a sequence that is processed by the non-causal sequence model to predict the query image's label. CAML is pre-trained on diverse few-shot image classification tasks, avoiding the need for meta-training or fine-tuning during inference. The authors theoretically demonstrate that using an ELMES class encoder maximizes the model's ability to identify classes within the support set. They evaluate CAML on 11 few-shot image classification benchmarks, comparing its performance against existing meta-learning methods in a universal setting.	CAML achieves state-of-the-art performance in universal meta-learning, outperforming other baselines on 14 out of 22 evaluation settings. Remarkably, it performs comparably to P>M>F, the current best meta-learning algorithm, on 8 out of 11 benchmarks, even though P>M>F is meta-trained on the specific benchmark datasets. This suggests that visual in-context learning during inference can be as effective as meta-training on in-domain data. The paper also provides analysis showing CAML's capability to dynamically update representations based on the query and support set context, enabling it to perform well on diverse tasks.	The paper acknowledges limitations in handling highly out-of-distribution images and varying image resolutions. Future work could focus on improving robustness in these areas. Additionally, the current implementation requires knowing the maximum number of classes during pre-training. Exploring methods to overcome this limitation and enable more flexible class handling during inference would be beneficial.	diffusion_model, llm, analysis, few-shot learning, image classification, meta-learning, in-context learning, universal meta-learning
2312.13286	Generative Multimodal Models are In-Context Learners	Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang	The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.	The paper introduces Emu++, a 37B parameter generative multimodal model trained on a massive dataset of text and image-text pairs, demonstrating strong in-context learning capabilities in multimodal tasks.	This work is important as it presents a significant step towards building adaptable and general-purpose multimodal systems capable of solving diverse tasks with minimal task-specific training.	The authors trained Emu++ using a unified autoregressive objective to predict the next multimodal element (visual embedding or text token) in a sequence, leveraging a large-scale dataset of text, image-text pairs, and interleaved image-text-video data. They further enhance the model for instruction following and controllable visual generation through instruction tuning on dedicated datasets.	Emu++ achieves state-of-the-art performance on various multimodal benchmarks, including visual question answering, image captioning, and text-to-image generation. It exhibits strong few-shot learning capabilities, improving with more in-context examples. The model also demonstrates emergent abilities like visual prompting and object-grounded generation.	The authors acknowledge limitations regarding potential biases in training data and the possibility of generating harmful content. Future work includes enhancing robustness, reducing hallucinations, improving fairness, and addressing the performance gap with closed multimodal systems in complex reasoning tasks.	diffusion_model, llm, analysis, 3d, motion, video, interpretability
2308.07926	CoDeF: Content Deformation Fields for Temporally Consistent Video Processing	Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen	We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at https://qiuyu96.github.io/CoDeF/.	This paper introduces Content Deformation Fields (CoDeF), a novel video representation comprising a canonical content field for static content and a temporal deformation field tracking transformations. This representation facilitates applying image algorithms to videos for temporally consistent video processing.	This paper is important as it bridges the gap between advanced image processing algorithms and video processing, offering a method for temporally consistent video editing and manipulation that surpasses previous techniques in quality and efficiency.	The authors employ a 2D hash-based image field for the canonical content and a 3D hash-based field for temporal deformation, trained through a rendering pipeline. They introduce techniques like annealed hash encoding and flow-guided consistency loss to ensure semantic correctness and smoothness. The system is evaluated on tasks like video reconstruction, translation, keypoint tracking, object tracking, and super-resolution.	CoDeF achieves superior video reconstruction quality with a 4.4 dB higher PSNR than Neural Image Atlas and significantly faster training (5 minutes vs. 10 hours). It effectively lifts image algorithms to video tasks, demonstrating superior temporal consistency in video-to-video translation, keypoint tracking on non-rigid objects, and object tracking compared to previous methods.	The paper acknowledges limitations regarding per-scene optimization, challenges with extreme viewpoint changes, and handling large non-rigid deformations. Future work may explore feed-forward implicit field techniques, 3D prior knowledge integration, and using multiple canonical images to address these limitations.	diffusion_model, video, motion, video_editing, representation_learning, temporal_consistency
2308.07863	StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models	Zhizhong Wang, Lei Zhao, Wei Xing	Content and style (C-S) disentanglement is a fundamental problem and critical challenge of style transfer. Existing approaches based on explicit definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable nor easy to control, resulting in entangled representations and less satisfying results. In this paper, we propose a new C-S disentangled framework for style transfer without using previous assumptions. The key insight is to explicitly extract the content information and implicitly learn the complementary style information, yielding interpretable and controllable C-S disentanglement and style transfer. A simple yet effective CLIP-based style disentanglement loss coordinated with a style reconstruction prior is introduced to disentangle C-S in the CLIP image space. By further leveraging the powerful style removal and generative ability of diffusion models, our framework achieves superior results than state of the art and flexible C-S disentanglement and trade-off control. Our work provides new insights into the C-S disentanglement in style transfer and demonstrates the potential of diffusion models for learning well-disentangled C-S characteristics.	This paper presents StyleDiffusion, a novel content-style disentangled framework for artistic style transfer that leverages diffusion models for explicit content extraction and implicit style learning, enabling interpretable and controllable style transfer.	This paper is significant as it addresses limitations of existing style transfer methods that rely on explicit style definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) which often result in entangled representations. The proposed method achieves superior style transfer results with better content preservation, fine style details, and flexible disentanglement control.	The authors introduce a diffusion-based style removal module to extract domain-aligned content information and a diffusion-based style transfer module to learn disentangled style from a single style image. A CLIP-based style disentanglement loss, combined with a style reconstruction prior, is used to guide the learning process in the CLIP image space.	StyleDiffusion demonstrates impressive qualitative and quantitative results, outperforming SOTA methods in terms of content preservation (SSIM), style similarity (CLIP Score), and user preference. The framework offers flexible control over content-style disentanglement and trade-off at both training and testing stages by adjusting diffusion model parameters. It also exhibits potential for extensions such as photo-realistic style transfer, multi-modal style manipulation, and diversified style transfer.	Limitations include the requirement for fine-tuning for each style, relatively slower inference due to diffusion models, and some failure cases like vanishing salient content or biased color distribution. Future work includes exploring arbitrary style transfer, accelerating diffusion sampling, and addressing the identified failure cases. Additionally, applying the framework to other image translation and manipulation tasks is another potential direction.	diffusion_model, style_transfer, disentanglement, clip, analysis, image_manipulation, photorealistic, multi-modal
2309.14564	Generative Escher Meshes	Noam Aigerman, Thibault Groueix	This paper proposes a fully-automatic, text-guided generative method for producing periodic, repeating, tile-able 2D art, such as the one seen on floors, mosaics, ceramics, and the work of M.C. Escher. In contrast to the standard concept of a seamless texture, i.e., square images that are seamless when tiled, our method generates non-square tilings which comprise solely of repeating copies of the same object. It achieves this by optimizing both geometry and color of a 2D mesh, in order to generate a non-square tile in the shape and appearance of the desired object, with close to no additional background details. We enable geometric optimization of tilings by our key technical contribution: an unconstrained, differentiable parameterization of the space of all possible tileable shapes for a given symmetry group. Namely, we prove that modifying the laplacian used in a 2D mesh-mapping technique - Orbifold Tutte Embedding - can achieve all possible tiling configurations for a chosen planar symmetry group. We thus consider both the mesh's tile-shape and its texture as optimizable parameters, rendering the textured mesh via a differentiable renderer. We leverage a trained image diffusion model to define a loss on the resulting image, thereby updating the mesh's parameters based on its appearance matching the text prompt. We show our method is able to produce plausible, appealing results, with non-trivial tiles, for a variety of different periodic tiling patterns.	This paper presents a novel method for generating tileable, non-square 2D art, similar to the works of M.C. Escher, by combining mesh deformation, texture optimization, and text-guided diffusion models.	The ability to automatically generate appealing and complex tiling patterns has significant implications for various fields, including art, design, and architecture, while also offering a new approach to exploring the space of tileable shapes.	The authors represent the tile as a textured 2D mesh and leverage Orbifold Tutte Embeddings (OTE) to ensure tileability while optimizing mesh vertices. They use a differentiable renderer to generate an image of the tile and apply Score Distillation Sampling (SDS) with a pre-trained diffusion model to guide the optimization towards matching a user-provided text prompt.	The method successfully produces a wide variety of compelling tileable shapes with different symmetries, demonstrating its ability to generate complex and plausible imagery from text prompts while adhering to strict geometric constraints.	Limitations include the restriction to wallpaper group tilings, difficulty in generating complex multi-object scenes, and reliance on SDS, which has limitations in speed, color saturation, and controllability. Future work could explore extensions to aperiodic tilings, multi-object tile generation, and integration with more advanced text-guided image generation techniques.	diffusion_model, 2d, tiling, mesh, generative, text-guided, ote, sds
2312.04410	Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models	Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi	Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks, including image interpolation, inversion, and editing. In this work, we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue, we propose Smooth Diffusion, a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition, we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion.	This paper introduces Smooth Diffusion, a novel diffusion model architecture that aims to improve the smoothness of the latent space in text-to-image generation tasks for enhanced performance in downstream tasks like image interpolation, inversion, and editing.	This paper is important because it addresses the limitations of current diffusion models in terms of latent space smoothness, which hinder the quality of downstream tasks. By proposing Smooth Diffusion with a novel regularization technique, this work paves the way for higher-quality and more controllable image generation and manipulation.	The authors propose Smooth Diffusion, which introduces Step-wise Variation Regularization to enforce a constant ratio between variations in input latent code and the output image at every training step. They train Smooth Diffusion on top of Stable Diffusion using the LAION Aesthetics 6.5+ dataset and a LoRA fine-tuning technique. To assess the smoothness, they propose a new metric, Interpolation Standard Deviation (ISTD), and compare Smooth Diffusion with Stable Diffusion and other state-of-the-art methods on various downstream tasks qualitatively and quantitively using metrics such as FID, CLIP Score, MSE, LPIPS, SSIM, and PSNR.	Smooth Diffusion demonstrates significantly smoother latent space interpolation compared to Stable Diffusion, evidenced by lower ISTD scores and smoother visual transitions. Furthermore, Smooth Diffusion shows superior performance in image inversion and reconstruction, particularly when using DDIM inversion, and achieves better preservation of unedited content in both text-based and drag-based image editing tasks.	The authors acknowledge that the effectiveness of the Smooth Diffusion's LoRA component, while adaptable to other models with the same architecture as Stable Diffusion, is not guaranteed and requires further investigation. Additionally, the paper suggests exploring the application of Smooth Diffusion to more challenging tasks, such as video generation, as a potential area for future work.	diffusion_model, text-to-image, latent_space, smoothness, image_interpolation, image_inversion, image_editing, lora
2311.03335	Cross-Image Attention for Zero-Shot Appearance Transfer	Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or	Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images -- one depicting the target structure and the other specifying the desired appearance -- our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.	This paper presents a zero-shot approach for transferring visual appearance between objects in different images, leveraging the semantic knowledge encoded within pretrained text-to-image diffusion models.	This paper is significant because it offers a novel method for appearance transfer that doesn't require training a new model or per-image optimization, unlike existing approaches. It leverages the power of pretrained diffusion models and their ability to capture semantic correspondences between images, even across different object categories.	The authors introduce a 'Cross-Image Attention' mechanism that replaces the standard self-attention layers within the denoising network of a diffusion model. By combining queries from the structure image with keys and values from the appearance image, the model implicitly learns to transfer visual features. To improve the transfer quality, they employ techniques like attention map contrasting, appearance guidance, and AdaIN normalization.	The paper demonstrates high-quality appearance transfer results across various object domains, including challenging cases with variations in object shape, viewpoint, and even different object categories. Qualitative and quantitative comparisons with existing techniques like Swapping Autoencoders, SpliceVIT, and DiffuseIT show that their method achieves a better balance between structure preservation and accurate appearance transfer. A user study further confirms these findings, highlighting the superior quality and appearance fidelity of the generated images.	The authors acknowledge limitations related to the model's ability to establish accurate correspondences, especially between semantically dissimilar objects. Additionally, the success of the transfer relies on accurate inversion of input images into the diffusion model's latent space, which can be sensitive to the inversion process and random seeds. Future work could focus on improving the robustness of cross-domain transfer and enhancing the inversion techniques for more reliable and editable latent codes.	diffusion_model, appearance_transfer, semantic_correspondence, zero-shot, image_manipulation, self-attention, denoising_diffusion_model
2405.04404	Vision Mamba: A Comprehensive Survey and Taxonomy	Xiao Liu, Chenxu Zhang, Lei Zhang	State Space Model (SSM) is a mathematical model used to describe and analyze the behavior of dynamic systems. This model has witnessed numerous applications in several fields, including control theory, signal processing, economics and machine learning. In the field of deep learning, state space models are used to process sequence data, such as time series analysis, natural language processing (NLP) and video understanding. By mapping sequence data to state space, long-term dependencies in the data can be better captured. In particular, modern SSMs have shown strong representational capabilities in NLP, especially in long sequence modeling, while maintaining linear time complexity. Notably, based on the latest state-space models, Mamba merges time-varying parameters into SSMs and formulates a hardware-aware algorithm for efficient training and inference. Given its impressive efficiency and strong long-range dependency modeling capability, Mamba is expected to become a new AI architecture that may outperform Transformer. Recently, a number of works have attempted to study the potential of Mamba in various fields, such as general vision, multi-modal, medical image analysis and remote sensing image analysis, by extending Mamba from natural language domain to visual domain. To fully understand Mamba in the visual domain, we conduct a comprehensive survey and present a taxonomy study. This survey focuses on Mamba's application to a variety of visual tasks and data types, and discusses its predecessors, recent advances and far-reaching impact on a wide range of domains. Since Mamba is now on an upward trend, please actively notice us if you have new findings, and new progress on Mamba will be included in this survey in a timely manner and updated on the Mamba project at https://github.com/lx6c78/Vision-Mamba-A-Comprehensive-Survey-and-Taxonomy.	This paper presents a comprehensive survey of Mamba, a novel deep learning architecture based on state space models (SSMs), and its applications in various computer vision tasks.	This survey is important because it provides a timely and comprehensive overview of Mamba, which is rapidly gaining traction in the computer vision community as a more efficient alternative to Transformers and CNNs, particularly for processing long sequences and high-resolution images.	The authors conduct their research by reviewing existing literature on Mamba and categorizing its variants based on their application in different vision tasks, including general vision, multi-modal learning, and vertical domains like remote sensing and medical image analysis.	The paper highlights the successful implementation of Mamba across a wide spectrum of vision tasks, showcasing its superior performance in terms of efficiency, accuracy, and memory usage compared to traditional architectures. Key results include state-of-the-art performance achieved by Mamba variants in image classification, object detection, semantic segmentation, image restoration, 3D vision, and multi-modal tasks.	The authors identify several limitations and future research directions for Mamba, including the need for new scanning mechanisms to better handle the non-causal nature of visual data, the exploration of synergistic hybrid architectures combining Mamba with other approaches like Transformers, the development of large-scale Mamba models, and its integration with other methodologies such as diffusion models and domain generalization.	state_space_model, mamba, computer_vision, image_classification, object_detection, semantic_segmentation, image_restoration, 3d, multi-modal, remote_sensing, medical_image_analysis, survey, literature_review
2403.12931	You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs	Yihong Luo, Xiaolong Chen, Jing Tang	We introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis. This is achieved by integrating the diffusion process with GANs. Specifically, we smooth the distribution by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we show that our method can be extended to finetune pre-trained text-to-image diffusion for high-quality one-step text-to-image synthesis even with LoRA fine-tuning. In particular, we provide the first diffusion transformer that can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without explicit training. Our code is provided at https://github.com/Luo-Yihong/YOSO.	This paper introduces YOSO, a novel one-step image synthesis model that integrates diffusion models with Generative Adversarial Networks (GANs) for rapid, scalable, and high-fidelity image generation.	This paper is important because it addresses the limitations of traditional diffusion models, which require iterative denoising and suffer from slow generation speed. YOSO offers a solution by enabling one-step generation without compromising image quality, making it highly relevant for practical applications.	The authors propose a self-cooperative learning approach where the generator learns from itself by matching the distribution of generated samples at different levels of corruption. They also introduce several techniques for text-to-image generation, including latent perceptual loss, latent discriminator, and fixing the noise scheduler.	YOSO achieves competitive performance on unconditional image generation, outperforming other one-step methods and even rivaling multi-step diffusion models. In text-to-image generation, YOSO demonstrates superior image quality, prompt alignment, and mode cover compared to state-of-the-art one-step models like SD-Turbo and SDXL-Turbo. Notably, YOSO-LoRA, a fine-tuned version, achieves impressive results with only LoRA fine-tuning, showcasing its efficiency. Furthermore, YOSO exhibits promising compatibility with downstream tasks such as image-to-image editing and ControlNet.	The authors acknowledge limitations in fine-tuning on datasets different from the pre-trained model's training set, leading to distribution shift. They suggest training on larger and more diverse datasets like LAION to address this issue. Additionally, exploring more advanced noise scheduler adaptation techniques and expanding YOSO's application in various downstream tasks are highlighted as future work.	diffusion_model, gan, image_synthesis, one-step_generation, text-to-image, lora, self-cooperative_learning, latent_perceptual_loss, latent_discriminator
2311.15127	Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach	We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .	This paper introduces Stable Video Diffusion (SVD), a latent diffusion model for generating high-resolution videos from text or image prompts.	This paper addresses the lack of focus on data selection in video generation research by demonstrating the significant impact of systematic data curation on the quality of generated videos, leading to state-of-the-art results in text-to-video and image-to-video synthesis.	The authors develop a three-stage training strategy: 1) Image pretraining using Stable Diffusion 2.1, 2) Video pretraining on a large, curated dataset at low resolution, and 3) High-resolution video finetuning on a smaller, high-quality dataset. They also employ techniques like EDM-preconditioning, classifier-free guidance, and temporal attention layers.	The resulting SVD model excels at generating high-resolution videos from text and image prompts, outperforming existing models in quality and motion representation. It also demonstrates strong multi-view consistency, making it suitable for multi-view synthesis with superior results compared to specialized methods like Zero123XL and SyncDreamer.	While successful in short video generation, SVD faces limitations in long-form video synthesis due to computational costs and occasional lack of motion in generated videos. Future work could explore cascaded frame generation, dedicated video tokenizers, and diffusion distillation for faster inference and long-form generation.	diffusion_model, video, text-to-video, image-to-video, 3d, motion, multi-view, data_curation
2405.04795	Variational Schrödinger Diffusion Models	Wei Deng, Weijian Luo, Yixin Tan, Marin Biloš, Yu Chen, Yuriy Nevmyvaka, Ricky T. Q. Chen	Schr\"odinger bridge (SB) has emerged as the go-to method for optimizing transportation plans in diffusion models. However, SB requires estimating the intractable forward score functions, inevitably resulting in the costly implicit training loss based on simulated trajectories. To improve the scalability while preserving efficient transportation plans, we leverage variational inference to linearize the forward score functions (variational scores) of SB and restore simulation-free properties in training backward scores. We propose the variational Schr\"odinger diffusion model (VSDM), where the forward process is a multivariate diffusion and the variational scores are adaptively optimized for efficient transport. Theoretically, we use stochastic approximation to prove the convergence of the variational scores and show the convergence of the adaptively generated samples based on the optimal variational scores. Empirically, we test the algorithm in simulated examples and observe that VSDM is efficient in generations of anisotropic shapes and yields straighter sample trajectories compared to the single-variate diffusion. We also verify the scalability of the algorithm in real-world data and achieve competitive unconditional generation performance in CIFAR10 and conditional generation in time series modeling. Notably, VSDM no longer depends on warm-up initializations and has become tuning-friendly in training large-scale experiments.	This paper presents Variational Schr\"odinger Diffusion Model (VSDM), a novel diffusion model that leverages variational inference to enhance the scalability of Schr\"odinger bridge (SB) for optimizing transportation plans, while preserving efficient transport.	While SB offers optimal transport guarantees, it faces scalability limitations due to the need for costly simulated trajectories. VSDM overcomes this by linearizing forward score functions, leading to closed-form updates and enabling simulation-free training of backward score functions. This enhances scalability and makes the algorithm more tuning-friendly for large-scale experiments.	The authors employ variational inference to approximate the forward score function in SB using a locally linear function, leading to the variational FB-SDE. They then utilize a multivariate OU process for the forward diffusion and derive closed-form expressions for the backward score function. They also use stochastic approximation to adaptively optimize the variational score for efficient transport.	VSDM demonstrates effectiveness in generating anisotropic shapes and produces straighter sample trajectories, indicating more efficient transport, compared to single-variate diffusions. It achieves competitive performance in image generation on CIFAR10 and conditional time series modeling, all without relying on warm-up initializations. Furthermore, VSDM is observed to be significantly faster than the original SB with nonlinear forward scores.	The paper acknowledges that linearizing the forward score function inevitably results in sub-optimal transport in general cases. Future work includes exploring critically damped (momentum) acceleration and Hessian approximations to develop advanced optimization techniques akin to "ADAM" for diffusion models.	diffusion_model, optimal_transport, variational_inference, stochastic_approximation, schrodinger_bridge, simulation-free, image_generation, time_series_forecasting
2311.13231	Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model	Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li	Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models. Our code is publicly available at https://github.com/yk7333/D3PO.	This paper introduces D3PO, a novel method for directly fine-tuning diffusion models using human feedback without relying on a separate reward model, addressing the limitations of traditional RLHF methods in this domain.	This research is important because it offers a more efficient and cost-effective approach to aligning diffusion models with human preferences, potentially impacting diverse applications like image generation, by eliminating the resource-intensive task of training a separate reward model.	The authors reinterpret the denoising process of diffusion models as a multi-step Markov Decision Process (MDP). They then extend the Direct Preference Optimization (DPO) framework, originally designed for Large Language Models, to this MDP. This allows them to directly update the model's policy based on human preferences, bypassing the need for a reward model.	D3PO demonstrated comparable or superior performance to methods relying on reward models in tasks like image compressibility and aesthetic quality. It also proved effective in challenging scenarios without a reward model, successfully reducing image distortions, enhancing image safety, and improving prompt-image alignment.	The paper acknowledges the limitations stemming from assumptions like the normality of expected return and the use of relative reward sizes. Future work may explore relaxing these assumptions and investigating the effectiveness of D3PO in more complex real-world applications.	diffusion_model, rlhf, dpo, image_generation, human_feedback, image_quality, safety, prompt-image_alignment
2404.07554	CAT: Contrastive Adapter Training for Personalized Image Generation	Jae Wan Park, Sang Hyun Park, Jun Young Koh, Junha Lee, Min Song	The emergence of various adapters, including Low-Rank Adaptation (LoRA) applied from the field of natural language processing, has allowed diffusion models to personalize image generation at a low cost. However, due to the various challenges including limited datasets and shortage of regularization and computation resources, adapter training often results in unsatisfactory outcomes, leading to the corruption of the backbone model's prior knowledge. One of the well known phenomena is the loss of diversity in object generation, especially within the same class which leads to generating almost identical objects with minor variations. This poses challenges in generation capabilities. To solve this issue, we present Contrastive Adapter Training (CAT), a simple yet effective strategy to enhance adapter training through the application of CAT loss. Our approach facilitates the preservation of the base model's original knowledge when the model initiates adapters. Furthermore, we introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to keep the former information. We qualitatively and quantitatively compare CAT's improvement. Finally, we mention the possibility of CAT in the aspects of multi-concept adapter and optimization.	This paper introduces CAT (Contrastive Adapter Training), a method for personalized image generation using diffusion models that leverages a contrastive loss function to preserve the base model's knowledge while training adapters, improving upon existing methods like LoRA and Dreambooth.	The paper addresses the limitations of current personalized image generation techniques, which often lead to knowledge corruption and underfitting in diffusion models, by proposing a novel training pipeline that combines contrastive learning with adapter training, resulting in better preservation of the original model's capabilities and more diverse and controllable generation.	The authors propose CAT, which adds a contrastive loss term to the adapter training objective. This loss encourages the adapted model's noise predictions to be similar to the original model's predictions when no trigger token is present, ensuring the preservation of the base model's knowledge. The method is evaluated using established metrics like prompt similarity and identity similarity, alongside a newly introduced metric called Knowledge Preservation Score (KPS) to quantify knowledge retention.	CAT outperforms existing adapter training methods in preserving the original model’s knowledge while achieving comparable identity generation fidelity. This is demonstrated through quantitative results using metrics like KPS and qualitative comparisons of generated images, showcasing CAT's ability to maintain diversity and avoid mode collapse.	The paper acknowledges limitations in evaluating diversity and fidelity due to the instability of CLIP-based scores and the lack of investigation into the impact of domain discrepancies between the model and training data. Future work aims to establish a reliable benchmark for consistent character generation, explore the impact of CAT's structure and application more thoroughly, and expand CAT to support multi-concept training with per-token loss for enhanced multi-concept generation.	diffusion_model, adapter, lora, dreambooth, personalization, image_generation, contrastive_learning, knowledge_preservation
2312.02116	GIVT: Generative Infinite-Vocabulary Transformers	Michael Tschannen, Cian Eastwood, Fabian Mentzer	We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $\beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework	This paper introduces GIVT (Generative Infinite-Vocabulary Transformer), a novel transformer decoder-only architecture capable of generating sequences of real-valued vectors, eliminating the need for quantization used in previous methods like VQ-GAN and MaskGIT.	This work is significant as it presents the first successful attempt at utilizing transformer decoders for generating continuous, unquantized vector sequences, thereby avoiding limitations associated with VQ-based methods. It paves the way for more efficient and higher-quality image generation and representation learning, while also being directly applicable to multimodal interleaved modeling.	The authors modify the standard transformer decoder architecture by replacing the input embedding lookup table with a linear projection layer for real-valued vectors and predicting parameters of a Gaussian Mixture Model (GMM) at the output. They train GIVT on the latent space of a \beta-VAE using teacher forcing and masked language modeling approaches, exploring various sampling techniques like temperature sampling, beam search, and a novel distribution-based classifier-free guidance (DB-CFG).	GIVT outperforms VQ-GAN, MaskGIT, and some diffusion models in class-conditional image generation on ImageNet, achieving comparable image quality with a smaller model size and faster sampling. Notably, GIVT demonstrates competitive performance in representation learning and dense prediction tasks like panoptic segmentation and depth estimation using the UViM framework.	Limitations include the challenge of end-to-end training of VAE and GIVT, which is left for future work. The authors suggest exploring applications of GIVT to other data modalities like audio and time-series modeling.	diffusion_model, gan, vae, transformer, image_generation, representation_learning, panoptic_segmentation, depth_estimation, gmm, classifier-free_guidance
2401.05293	Score Distillation Sampling with Learned Manifold Corrective	Thiemo Alldieck, Nikos Kolotouros, Cristian Sminchisescu	Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects. Instead, we train a shallow network mimicking the timestep-dependent denoising deficiency of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through several qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.	This paper presents an analysis of the Score Distillation Sampling (SDS) loss function, identifies a noise issue in its gradients, and proposes a solution called Learned Manifold Corrective SDS (LMC-SDS) to improve gradient quality and reduce reliance on high guidance weights.	This paper is important because it addresses limitations of SDS, a popular method for using pre-trained diffusion models as priors in various tasks like image synthesis, editing, and 3D generation. By improving the SDS loss, it enables more stable optimization, better image fidelity, and wider applicability.	The authors decompose the SDS loss, identify a problematic term causing noisy gradients, and propose LMC-SDS to model and factor out the time-step dependent image corruption in the denoising process. They train a shallow network to approximate this corruption and use it to correct the gradients, promoting movement towards the manifold of natural images. They demonstrate LMC-SDS effectiveness through qualitative and quantitative experiments on image synthesis, editing, image translation network training, and 3D asset generation.	The proposed LMC-SDS loss leads to: 1) More stable optimization with less reliance on high guidance weights, resulting in less saturated colors and fewer artifacts. 2) Higher fidelity results in image synthesis and editing tasks, better preserving image structure while achieving significant edits. 3) Improved performance in training image-to-image translation networks, as demonstrated by the 'cats-to-others' experiment. 4) Enhanced detail and reduced Janus problem in 3D asset generation using DreamFusion.	The paper acknowledges limitations in LMC-SDS, where it might not perform well if the diffusion model doesn't understand the prompt or if the optimization strays too far from the natural image manifold. Future work includes further improving the manifold corrective and applying the findings to specific applications like text-to-3D and image editing.	diffusion_model, analysis, image_synthesis, image_editing, 3d, text-to-3d, optimization, loss_function, denoising
2402.15120	Fine-tuning CLIP Text Encoders with Two-step Paraphrasing	Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang	Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 2.0% and 5.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.	This paper presents ParaCLIP, a fine-tuning approach for CLIP models that enhances their understanding and handling of paraphrased text inputs by leveraging synthetic paraphrases generated from large language models.	This work addresses the challenge of linguistic variation in text inputs for vision-language tasks, which limits the robustness of existing CLIP models in real-world applications. ParaCLIP improves the representation of paraphrases in CLIP's text encoder, leading to better performance in tasks requiring semantic understanding and compositionality.	The authors propose a two-step paraphrasing process using LLMs (ChatGPT, LLaMA) to generate two categories of paraphrases for image captions. Then, they fine-tune the CLIP text encoder with these paraphrases while keeping the image encoder frozen. The training objective consists of three InfoNCE losses: image-paraphrase, caption-paraphrase, and paraphrase-paraphrase.	ParaCLIP consistently outperforms baseline CLIP models in tasks like paraphrased retrieval, Visual Genome Relation and Attribution, and semantic textual similarity. Notably, it significantly improves average overlap and Jaccard similarity scores in paraphrased retrieval, indicating better handling of linguistic variations. The ablation study highlights the importance of each loss function in achieving balanced performance across different tasks.	The authors acknowledge that their method may sometimes degrade performance on standard vision and vision-language tasks like zero-shot classification and image retrieval, possibly due to limitations in computational resources to use large batch sizes during fine-tuning. Future work involves investigating factors contributing to this performance degradation and exploring the potential of the approach to address compositional understanding limitations in CLIP models.	clip, paraphrase, fine-tuning, llm, vision-language, image_retrieval, semantic_textual_similarity
2311.17009	Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer	Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel	We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.	This paper introduces a novel method for text-driven motion transfer in videos, enabling the transfer of motion from a source video to a target object specified by a text prompt, even when the source and target objects have significant differences in shape and motion characteristics.	This paper pushes the boundaries of motion transfer beyond previous methods limited to similar object categories. It offers a zero-shot approach, leveraging the generative capabilities of pre-trained text-to-video diffusion models for a more versatile and accessible motion transfer solution.	The authors analyze the space-time features learned by a text-to-video diffusion model and introduce a novel loss function based on pairwise differences of spatial marginal mean features. This loss guides the generation process to preserve motion characteristics while accommodating significant structural deviations between source and target objects.	The proposed method demonstrates state-of-the-art performance in preserving motion fidelity while adhering to the target text prompt. It outperforms existing methods in qualitative and quantitative comparisons, showcasing successful motion transfer across diverse object categories with significant shape variations. User studies further confirm the superiority of the generated videos, highlighting their improved quality and adherence to the target prompts.	The method's reliance on the pre-trained text-to-video model's generative capabilities poses limitations. The model's training data might not encompass all possible object-motion combinations, leading to reduced motion fidelity or artifacts. Future work could explore larger and more diverse training datasets for text-to-video models and investigate alternative optimization strategies to further enhance motion fidelity in challenging cases.	diffusion_model, motion, video, text-to-video, motion_transfer, zero-shot
2405.04517	xLSTM: Extended Long Short-Term Memory	Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter	In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.	The paper introduces Extended Long Short-Term Memory (xLSTM), a novel recurrent neural network architecture that builds upon the original LSTM by introducing exponential gating with memory mixing and a new memory structure, achieving comparable and in many cases better performance than Transformers and State Space Models in language modeling.	This paper is important because it revives the LSTM for large language models, showing that LSTMs, when properly scaled and enhanced, can compete with and even surpass the performance of dominant architectures like Transformers and State Space Models, potentially impacting various deep learning fields.	The authors introduce two new LSTM variants: sLSTM with exponential gating and a scalar memory, and mLSTM with exponential gating and a matrix memory using a covariance update rule. They integrate these variants into residual blocks, stack them to form xLSTM architectures, and evaluate them on synthetic tasks, the Long Range Arena, and language modeling benchmarks (SlimPajama and PALOMA).	The xLSTM architecture outperforms state-of-the-art Transformers, State Space Models, and RNNs in most experiments. Notably, xLSTM excels at sequence length extrapolation, consistently maintaining low perplexity even for longer contexts unseen during training, exhibits superior memory capacity in associative recall tasks, demonstrates strong performance on the Long Range Arena, and achieves state-of-the-art results in perplexity and downstream tasks on both SlimPajama and PALOMA language modeling benchmarks.	Limitations: sLSTM lacks parallelizability due to memory mixing; current CUDA kernels for mLSTM are not fully optimized; mLSTM's matrix memory has high computational complexity; initialization of forget gates requires careful consideration; longer context sizes might overload the matrix memory. Future work: optimizing CUDA kernels for both sLSTM and mLSTM; exploring alternative memory structures with lower computational complexity; extensive architecture and hyperparameter optimization for larger xLSTM models; application of xLSTM to other deep learning domains beyond language modeling.	lstm, language_model, llm, rnn, transformer, state_space_model, gating, memory, analysis, scaling_law, sequence_length_extrapolation
2404.11614	Dynamic Typography: Bringing Text to Life via Video Diffusion Prior	Zichen Liu, Yihao Meng, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, Huamin Qu	Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which combines two challenging tasks. It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework. This framework employs neural displacement fields to convert letters into base shapes and applies per-frame motion, encouraging coherence with the intended textual concept. Shape preservation techniques and perceptual loss regularization are employed to maintain legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our end-to-end methodology over baseline methods, which might comprise separate tasks. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability. Our code is available at: https://animate-your-word.github.io/demo/.	This paper introduces "Dynamic Typography," a method for animating individual letters within words by deforming them to embody semantic meaning and infusing them with vivid movements based on user prompts.	This paper is important because it automates the creation of expressive and semantically aware text animations, a task traditionally requiring significant expertise in graphic design and animation. This approach makes text animation more accessible and efficient.	The authors use an end-to-end optimization-based framework that leverages vector graphics representations of letters. They employ neural displacement fields to deform letters into base shapes and apply per-frame motion guided by a pre-trained text-to-video model. They ensure legibility and structural integrity using perceptual loss regularization and shape preservation techniques.	The proposed method generates consistent and prompt-aware text animations while preserving legibility, outperforming baseline methods in quantitative and qualitative evaluations. The authors demonstrate the generalizability of their approach across various text-to-video models.	The authors acknowledge limitations regarding the motion quality being bounded by the capabilities of the video foundation model. Future work could explore incorporating future advancements in diffusion-based video foundation models. Additionally, challenges remain when user prompts significantly deviate from the original letter shapes, requiring further research to balance semantic representation with legibility.	diffusion_model, animation, text-to-video, kinetic typography, svg, interpretability
2308.14761	Unified Concept Editing in Diffusion Models	Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau	Text-to-image models suffer from various safety issues that may limit their suitability for deployment. Previous methods have separately addressed individual issues of bias, copyright, and offensive content in text-to-image models. However, in the real world, all of these issues appear simultaneously in the same model. We present a method that tackles all issues with a single approach. Our method, Unified Concept Editing (UCE), edits the model without training using a closed-form solution, and scales seamlessly to concurrent edits on text-conditional diffusion models. We demonstrate scalable simultaneous debiasing, style erasure, and content moderation by editing text-to-image projections, and we present extensive experiments demonstrating improved efficacy and scalability over prior work. Our code is available at https://unified.baulab.info	This paper introduces Unified Concept Editing (UCE), a closed-form model editing method for text-to-image diffusion models that can erase, moderate, and debias multiple concepts simultaneously without retraining.	This work addresses limitations in existing methods that handle bias, copyright, and offensive content separately in text-to-image models. UCE provides a unified, efficient, and scalable solution to tackle these issues concurrently, paving the way for safer and more responsible deployment of these models.	UCE builds upon prior model editing techniques like TIME and MEMIT, generalizing their closed-form weight update solutions for linear projection layers in diffusion models. By directly modifying cross-attention weights, it aligns text embeddings to manipulate concept generation. The method employs different target output strategies for each edit type: erasing associates concepts with different outputs, debiasing adjusts attribute magnitudes, and moderation replaces outputs with generic responses.	UCE demonstrates superior performance in erasing artistic styles while minimizing interference with unrelated concepts, outperforming baselines like ESD and Concept Ablation. It effectively debiases gender and racial biases in profession representations, surpassing existing methods in achieving balanced attribute distributions. Additionally, UCE exhibits comparable or better NSFW content moderation capabilities compared to ESD, while maintaining higher image quality and text-image alignment.	The authors acknowledge limitations in addressing compounding biases when debiasing across multiple attributes, as well as challenges posed by compositional bias effects in prompts. They also note that excessive artistic style erasures can degrade overall model performance, suggesting a need to preserve a critical mass of artistic knowledge. Future work could focus on mitigating these limitations, exploring joint attribute debiasing, and developing techniques to handle compositional bias.	diffusion_model, gan, analysis, adversarial_attack, interpretability, debias, erasure, moderation
2312.07991	Accelerating the Global Aggregation of Local Explanations	Alon Mor, Yonatan Belinkov, Benny Kimelfeld	Local explanation methods highlight the input tokens that have a considerable impact on the outcome of classifying the document at hand. For example, the Anchor algorithm applies a statistical analysis of the sensitivity of the classifier to changes in the token. Aggregating local explanations over a dataset provides a global explanation of the model. Such aggregation aims to detect words with the most impact, giving valuable insights about the model, like what it has learned in training and which adversarial examples expose its weaknesses. However, standard aggregation methods bear a high computational cost: a na\"ive implementation applies a costly algorithm to each token of each document, and hence, it is infeasible for a simple user running in the scope of a short analysis session. % We devise techniques for accelerating the global aggregation of the Anchor algorithm. Specifically, our goal is to compute a set of top-$k$ words with the highest global impact according to different aggregation functions. Some of our techniques are lossless and some are lossy. We show that for a very mild loss of quality, we are able to accelerate the computation by up to 30$\times$, reducing the computation from hours to minutes. We also devise and study a probabilistic model that accounts for noise in the Anchor algorithm and diminishes the bias toward words that are frequent yet low in impact.	This paper tackles the challenge of efficiently identifying the top-k most impactful words in a document collection for explaining text classifiers, focusing on global aggregation of the Anchor algorithm's local explanations.	Global aggregation of local explanations like Anchor is computationally expensive, hindering online analysis. This work provides both a novel probabilistic aggregation method that improves the quality of results and runtime optimizations making it practical for interactive use.	The authors propose a probabilistic model (GPR) to estimate the importance of words as explanations, considering frequency and noise. They introduce runtime optimizations including incremental evaluation, candidate filtering, and adjusted hyperparameters for Anchor. Experiments evaluate the quality and speed of their approach across various datasets and classification tasks.	GPR consistently outperforms baseline aggregations in identifying impactful terms. Optimizations, particularly increasing the confidence parameter (delta) in Anchor, significantly accelerate computation (up to 30x) with minimal or even positive impact on quality. Case studies demonstrate the interpretability of identified terms.	Future work includes extending the approach to multi-word terms, adapting the optimizations to other local attribution methods, and exploring alternative document traversal orders during aggregation.	analysis, interpretability, local explanation, global explanation, anchor algorithm, text classification, runtime optimization, anytime algorithm
2311.09257	UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs	Yanwu Xu, Yang Zhao, Zhisheng Xiao, Tingbo Hou	Text-to-image diffusion models have demonstrated remarkable capabilities in transforming textual prompts into coherent images, yet the computational cost of their inference remains a persistent challenge. To address this issue, we present UFOGen, a novel generative model designed for ultra-fast, one-step text-to-image synthesis. In contrast to conventional approaches that focus on improving samplers or employing distillation techniques for diffusion models, UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN objective. Leveraging a newly introduced diffusion-GAN objective and initialization with pre-trained diffusion models, UFOGen excels in efficiently generating high-quality images conditioned on textual descriptions in a single step. Beyond traditional text-to-image generation, UFOGen showcases versatility in applications. Notably, UFOGen stands among the pioneering models enabling one-step text-to-image generation and diverse downstream tasks, presenting a significant advancement in the landscape of efficient generative models.	This paper introduces UFOGen, a novel text-to-image generative model that leverages a hybrid approach combining diffusion models with a Generative Adversarial Network (GAN) objective to enable ultra-fast, one-step image generation from text prompts.	This paper is important because it addresses a key limitation of traditional text-to-image diffusion models, namely their slow inference speed due to the multi-step denoising process. UFOGen's ability to generate high-quality images in a single step significantly improves efficiency and expands the potential applications of such models.	The authors achieve one-step generation by modifying existing diffusion-GAN hybrid models in two key ways: 1) They introduce a new generator parameterization that samples from the forward diffusion process instead of the posterior, allowing for distribution matching at the clean image level. 2) They enhance the reconstruction loss to explicitly match the generated clean image with the target. By initializing UFOGen with a pre-trained Stable Diffusion model, they leverage existing knowledge about text-image relationships and achieve stable training with fast convergence.	UFOGen successfully generates high-quality images from text prompts in a single step, outperforming existing few-step diffusion models like Progressive Distillation and Latent Consistency Models in terms of visual quality. It demonstrates comparable performance to InstaFlow while offering advantages in training efficiency and a simpler training pipeline. Furthermore, UFOGen exhibits versatility by successfully adapting to downstream tasks like image-to-image generation and controllable generation, highlighting its flexibility and broader applicability.	The paper acknowledges limitations common to SD-based models, such as object missing, attribute leakage, and counting errors. Future work could focus on addressing these limitations and further exploring UFOGen's potential in more complex generative scenarios, such as video generation or 3D object synthesis. Additionally, investigating the model's capabilities under various guidance scales and comparing its performance against a wider range of text-to-image models would provide a more comprehensive understanding of its strengths and limitations.	diffusion_model, gan, text-to-image, one-step generation, image-to-image, controllable generation
2403.07860	Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation	Shihao Zhao, Shaozhe Hao, Bojia Zi, Huaizhe Xu, Kwan-Yee K. Wong	Text-to-image generation has made significant advancements with the introduction of text-to-image diffusion models. These models typically consist of a language model that interprets user prompts and a vision model that generates corresponding images. As language and vision models continue to progress in their respective domains, there is a great potential in exploring the replacement of components in text-to-image diffusion models with more advanced counterparts. A broader research objective would therefore be to investigate the integration of any two unrelated language and generative vision models for text-to-image generation. In this paper, we explore this objective and propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and plug-and-play approach without requiring modifications to the original weights of the language and vision models. Our pipeline is compatible with various language models and generative vision models, accommodating different structures. Within this framework, we demonstrate that incorporating superior modules, such as more advanced language models or generative vision models, results in notable improvements in capabilities like text alignment or image quality. Extensive evaluations have been conducted to verify the effectiveness of LaVi-Bridge. Code is available at https://github.com/ShihaoZhaoZSH/LaVi-Bridge.	This paper introduces LaVi-Bridge, a novel framework designed for text-to-image diffusion models, aiming to seamlessly integrate various pre-trained language models and generative vision models.	This research is crucial due to the rapid advancements in both language and vision models, making it challenging to integrate them into existing text-to-image diffusion models. LaVi-Bridge addresses this challenge by offering a flexible and efficient way to combine diverse models, potentially leading to significant improvements in text-to-image generation capabilities.	LaVi-Bridge employs LoRA (Low-Rank Adaptation) to inject trainable parameters into pre-trained language and vision models without altering their original weights. Additionally, it utilizes an adapter to bridge the gap between these two modules, facilitating effective communication and alignment. The framework is trained on a dataset of text-image pairs, enabling the integrated models to generate coherent and contextually relevant images from textual prompts.	Experiments demonstrate LaVi-Bridge's effectiveness in integrating various language models (CLIP, T5 series, Llama-2) and vision models (U-Net, Vision Transformer). Notably, incorporating superior models leads to enhanced performance, such as improved semantic understanding with advanced language models (e.g., Llama-2) and enhanced image quality and aesthetics with powerful vision models (e.g., PixArt's Transformer).	The authors acknowledge that while LaVi-Bridge exhibits promising results, training with it on the same models and weights as existing text-to-image diffusion models may not always yield significant improvements. They emphasize that LaVi-Bridge primarily aims to integrate diverse language and vision models, enabling the use of more advanced models for potential performance enhancements. Future research directions could explore larger and more diverse datasets to further improve LaVi-Bridge's versatility and address the limitations associated with training data diversity.	diffusion_model, text-to-image, language_model, vision_model, lora, adapter, image_generation, semantic_understanding, image_quality
2403.19716	Capability-aware Prompt Reformulation Learning for Text-to-Image Generation	Jingtao Zhan, Qingyao Ai, Yiqun Liu, Jia Chen, Shaoping Ma	Text-to-image generation systems have emerged as revolutionary tools in the realm of artistic creation, offering unprecedented ease in transforming textual prompts into visual art. However, the efficacy of these systems is intricately linked to the quality of user-provided prompts, which often poses a challenge to users unfamiliar with prompt crafting. This paper addresses this challenge by leveraging user reformulation data from interaction logs to develop an automatic prompt reformulation model. Our in-depth analysis of these logs reveals that user prompt reformulation is heavily dependent on the individual user's capability, resulting in significant variance in the quality of reformulation pairs. To effectively use this data for training, we introduce the Capability-aware Prompt Reformulation (CAPR) framework. CAPR innovatively integrates user capability into the reformulation process through two key components: the Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). CRM reformulates prompts according to a specified user capability, as represented by CCF. The CCF, in turn, offers the flexibility to tune and guide the CRM's behavior. This enables CAPR to effectively learn diverse reformulation strategies across various user capacities and to simulate high-capability user reformulation during inference. Extensive experiments on standard text-to-image generation benchmarks showcase CAPR's superior performance over existing baselines and its remarkable robustness on unseen systems. Furthermore, comprehensive analyses validate the effectiveness of different components. CAPR can facilitate user-friendly interaction with text-to-image systems and make advanced artistic creation more achievable for a broader range of users.	This paper presents CAPR, a novel capability-aware prompt reformulation framework designed for text-to-image generation, which leverages user interaction logs to automatically improve user prompts.	This work addresses the challenge of crafting effective prompts for text-to-image generation systems, a task often difficult for average users. It's significant because it's the first to leverage interaction logs for this purpose, offering a practical solution to enhance user experience and generation quality.	The authors analyze interaction logs to understand user reformulation patterns and develop CAPR, comprising a Conditional Reformulation Model (CRM) and Configurable Capability Features (CCF). They train CRM on reformulation pairs, conditioned on CCF representing user capability. During inference, CCF is optimized to guide CRM towards high-quality reformulations.	Experimental results demonstrate that CAPR significantly outperforms various baselines, including large language models and models trained on synthetic data. It exhibits strong performance on both seen and unseen text-to-image generation systems, demonstrating its effectiveness and robustness.	The paper acknowledges that finding the optimal configuration for CCF can be time-consuming, though mitigated by techniques like Bayesian optimization. Future work could explore alternative CCF representations or personalize reformulations based on individual user styles.	diffusion_model, text-to-image generation, prompt reformulation, analysis, log analysis
2309.16671	Demystifying CLIP Data	Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer	Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.	This paper investigates the data curation process behind CLIP, proposing MetaCLIP, a transparent algorithm that uses metadata and balancing techniques to create high-quality image-text datasets from web sources like CommonCrawl.	The paper is important because it sheds light on the critical role of data curation in the success of CLIP, provides a method to reproduce and potentially outperform CLIP's dataset, and emphasizes the importance of data transparency in AI.	The authors meticulously reconstruct CLIP's metadata and analyze the sub-string matching and balancing techniques likely employed in CLIP's data curation. They then propose MetaCLIP, an algorithm that takes a raw data pool and metadata as input and outputs a balanced dataset. They evaluate MetaCLIP by training vision models using their curated data and comparing the performance against models trained on CLIP's data and other publicly available datasets.	MetaCLIP, trained on a 400M image-text pair dataset curated from CommonCrawl, outperforms CLIP's proprietary WIT400M dataset on multiple benchmarks, including ImageNet zero-shot classification. Scaling MetaCLIP to 1B and 2.5B data points further improves accuracy, achieving unprecedented results for various ViT model sizes, all within the same training budget as the original CLIP.	The authors acknowledge that their reconstruction of CLIP's metadata might not be perfectly accurate due to limited information available publicly. They also plan to improve the scalability of their data pipeline for handling even larger datasets. Further research is needed to explore the impact of different metadata sources and balancing strategies.	diffusion_model, clip, analysis, data_curation, image_text, zero_shot
2311.11919	An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis	Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, Balaji Vasan Srinivasan	We consider the problem of constraining diffusion model outputs with a user-supplied reference image. Our key objective is to extract multiple attributes (e.g., color, object, layout, style) from this single reference image, and then generate new samples with them. One line of existing work proposes to invert the reference images into a single textual conditioning vector, enabling generation of new samples with this learned token. These methods, however, do not learn multiple tokens that are necessary to condition model outputs on the multiple attributes noted above. Another line of techniques expand the inversion space to learn multiple embeddings but they do this only along the layer dimension (e.g., one per layer of the DDPM model) or the timestep dimension (one for a set of timesteps in the denoising process), leading to suboptimal attribute disentanglement. To address the aforementioned gaps, the first contribution of this paper is an extensive analysis to determine which attributes are captured in which dimension of the denoising process. As noted above, we consider both the time-step dimension (in reverse denoising) as well as the DDPM model layer dimension. We observe that often a subset of these attributes are captured in the same set of model layers and/or across same denoising timesteps. For instance, color and style are captured across same U-Net layers, whereas layout and color are captured across same timestep stages. Consequently, an inversion process that is designed only for the time-step dimension or the layer dimension is insufficient to disentangle all attributes. This leads to our second contribution where we design a new multi-attribute inversion algorithm, MATTE, with associated disentanglement-enhancing regularization losses, that operates across both dimensions and explicitly leads to four disentangled tokens (color, style, layout, and object).	This paper presents MATTE, a novel multi-attribute inversion algorithm for text-to-image diffusion models, enabling the extraction and disentanglement of color, object, layout, and style attributes from a single reference image for controlled image synthesis.	This work is significant because it addresses the limitations of existing inversion methods that struggle to disentangle multiple visual attributes from a reference image. By learning disentangled tokens for color, object, layout, and style, MATTE enables more fine-grained control over image generation conditioned on a reference.	The authors first conduct an extensive analysis of attribute distribution across layers and timesteps in the diffusion process. Informed by this analysis, they propose MATTE, which learns separate tokens for each attribute and trains them to influence specific layers or timesteps, thus achieving disentanglement. They introduce a novel loss function that encourages reconstruction fidelity while enforcing disentanglement among color, style, object, and layout.	MATTE demonstrates superior performance in extracting and transferring individual attributes and their combinations from a reference image to new generations. Qualitative results showcase its ability to control color, object, layout, and style independently, outperforming existing methods like P+ and ProSpect. Quantitative evaluations using CLIP similarity scores further validate the effectiveness of MATTE in learning disentangled and semantically meaningful attribute tokens.	The paper acknowledges limitations in terms of computational cost for the inversion process. Additionally, it recognizes that the final generation quality is limited by the base diffusion model's capabilities. Future work could focus on optimizing the efficiency of the inversion algorithm and exploring alternative methods to improve attribute control during generation, such as fine-tuning model weights.	diffusion_model, inversion, text-to-image, attribute-guided, disentanglement, analysis, image_synthesis, reference_image
2308.10718	Backdooring Textual Inversion for Concept Censorship	Yutong Wu, Jie Zhang, Florian Kerschbaum, Tianwei Zhang	Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at https://concept-censorship.github.io.	This paper presents a novel method for concept censorship in AI image generation by backdooring Textual Inversion (TI), a popular personalization technique.	This paper addresses the growing concern of misuse of AI image generation for malicious purposes like spreading misinformation or creating harmful content, by proposing a method to regulate personalization models without completely disabling them.	The authors propose a two-term loss function for training TI, incorporating a backdoor term that associates specific trigger words (sensitive concepts) with pre-defined target images, effectively preventing the generation of undesired content when those words are present in the prompt.	Experiments demonstrate the effectiveness of their method in censoring single words and blacklists of words, while preserving the utility of the TI for benign use. The method also exhibits robustness against potential countermeasures like word embedding removal and perturbation.	Limitations include the need for the publisher to retrain the TI model and the dependence on hyperparameter tuning. Future work could explore data-free approaches, reduce reliance on hyperparameters, and investigate semantic-wise censoring for improved practicality.	diffusion_model, textual_inversion, backdoor_attack, concept_censorship, aigc, misinformation, ethics
2309.11497	FreeU: Free Lunch in Diffusion U-Net	Chenyang Si, Ziqi Huang, Yuming Jiang, Ziwei Liu	In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising, whereas its skip connections mainly introduce high-frequency features into the decoder module, causing the network to overlook the backbone semantics. Capitalizing on this discovery, we propose a simple yet effective method-termed "FreeU" - that enhances generation quality without additional training or finetuning. Our key insight is to strategically re-weight the contributions sourced from the U-Net's skip connections and backbone feature maps, to leverage the strengths of both components of the U-Net architecture. Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference. Project page: https://chenyangsi.top/FreeU/.	This paper introduces FreeU, a method for improving the sample quality of diffusion models during inference by re-weighting the contributions of skip connections and backbone features in the U-Net architecture.	The paper is important because it addresses a critical gap in diffusion model research by focusing on the under-explored potential of the U-Net architecture itself, leading to improved generation quality without requiring additional training or increasing computational costs.	The authors conducted experiments using various diffusion models, including Stable Diffusion, DreamBooth, ModelScope, and Rerender, applying FreeU during inference. They analyzed the impact of backbone and skip connection scaling factors on the generated images and videos, comparing them with the baseline models.	The key finding is that FreeU significantly improves the quality of generated images and videos across various tasks, including text-to-image synthesis, text-to-video generation, image editing, and video-to-video translation. Notably, FreeU achieves these enhancements without requiring any additional training or fine-tuning of the models, making it a practical solution for enhancing diffusion model output.	The paper doesn't explicitly mention limitations, however, potential future work could explore the optimal balancing of backbone and skip connection features for specific tasks. Additionally, investigating the application of FreeU in other diffusion model architectures beyond U-Net would be beneficial.	diffusion_model, u-net, image_generation, video_generation, sample_quality, denoising, freeu
2311.12092	Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models	Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau	We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. Our approach identifies a low-rank parameter direction corresponding to one concept while minimizing interference with other attributes. A slider is created using a small set of prompts or sample images; thus slider directions can be created for either textual or visual concepts. Concept Sliders are plug-and-play: they can be composed efficiently and continuously modulated, enabling precise control over image generation. In quantitative experiments comparing to previous editing techniques, our sliders exhibit stronger targeted edits with lower interference. We showcase sliders for weather, age, styles, and expressions, as well as slider compositions. We show how sliders can transfer latents from StyleGAN for intuitive editing of visual concepts for which textual description is difficult. We also find that our method can help address persistent quality issues in Stable Diffusion XL including repair of object deformations and fixing distorted hands. Our code, data, and trained sliders are available at https://sliders.baulab.info/	This paper introduces Concept Sliders, a method for fine-tuning diffusion models using low-rank adaptations (LoRA) to enable precise and interpretable control over image attributes.	This work is significant because it addresses limitations of existing diffusion model editing techniques by providing: 1) fine-grained control over continuous attributes, 2) composability for multi-attribute editing, 3) ability to learn visual concepts from image pairs, 4) transfer of style latents from GANs, and 5) improvement of image quality by fixing common distortions.	The authors train LoRA adaptors using a guided score function that encourages the generation of images with desired attributes while preserving unrelated features. They use text prompt pairs, image pairs, and StyleGAN latents to define concepts and train the sliders. They evaluate their method on Stable Diffusion XL and SD v1.4, measuring CLIP score change, LPIPS distance, and conducting user studies to assess image quality.	Key findings include: 1) Concept Sliders enable precise control over various attributes, 2) image-based sliders effectively capture visual concepts, 3) StyleGAN latents can be transferred to diffusion models for nuanced style editing, and 4) sliders can fix hand distortions and enhance overall realism, as confirmed by user studies.	Limitations include residual interference between edits and a potential trade-off between edit strength and structural coherence when using the SDEdit technique. Future work could explore automated methods for minimizing interference and improving edit strength without sacrificing image structure.	diffusion_model, lora, analysis, image_editing, gan, stylegan, interpretability, 3d, concept_sliders
2404.02258	Mixture-of-Depths: Dynamically allocating compute in transformer-based language models	David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro	Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens ($k$) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-$k$ routing mechanism. Since $k$ is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the $k$ tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.	This paper introduces Mixture-of-Depths (MoD), a novel technique for transformer models that dynamically allocates compute resources by allowing tokens to skip entire transformer blocks based on learned routing decisions, thereby reducing computational cost without sacrificing performance.	This paper is important because it addresses the inherent inefficiency of traditional transformers, which expend uniform computational effort per token regardless of the complexity of the prediction. MoD offers a pathway to significantly reduce the computational cost of training and inference in transformers, particularly relevant for resource-intensive large language models, by selectively allocating compute resources where they are most needed.	The authors propose a method where a per-block router assigns scalar weights to each token, indicating the importance of processing that token through the block. The top-k tokens with the highest weights are processed through the self-attention and MLP layers, while the rest bypass the block through a residual connection. This dynamic allocation is achieved using a non-causal top-k routing scheme during training and a causal predictor-based routing scheme during inference, both of which are trained through the language modeling objective and an auxiliary task. The authors perform extensive experiments with different model sizes and FLOP budgets, comparing MoD transformers with traditional transformers, demonstrating significant performance gains and computational savings.	Key findings include: (1) MoD transformers can outperform isoFLOP-optimal baseline transformers in terms of both performance and speed. (2) Optimal MoD configurations involve routing every other block and using a low capacity (e.g., 12.5% of the sequence length) for the computationally intensive blocks. (3) Learned routing is crucial for MoD's effectiveness, significantly outperforming stochastic routing schemes. (4) MoD can be seamlessly integrated with Mixture-of-Experts (MoE) models, further enhancing performance and efficiency. (5) The non-causal nature of top-k routing during training can be effectively addressed during autoregressive sampling using a causal predictor, resulting in minimal performance degradation.	The paper acknowledges limitations and suggests future work: (1) While the current work focuses on a decoder-only setting, extending MoD to encoder-decoder architectures requires further investigation for efficient handling of sequential decoding with non-causal routing. (2) The paper primarily explores routing between standard transformer blocks and residual connections. Investigating routing to diverse computational paths like memory lookup or tool-use functions could be beneficial. (3) Future research could explore decoupling routing decisions for queries, keys, and values in self-attention, potentially leading to more nuanced and efficient compute allocation. (4) MoD's potential in drastically increasing context length for predictions by efficiently managing long-term memory through selective routing warrants further investigation.	diffusion_model, llm, analysis, conditional_computation, transformer, efficiency, mixture-of-experts, routing, autoregressive_sampling, long-term_memory
2311.16090	Self-correcting LLM-controlled Diffusion Models	Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell	Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications.	This paper introduces Self-correcting LLM-controlled Diffusion (SLD), a framework that improves text-to-image generation by iteratively identifying and correcting inaccuracies in images generated by diffusion models using an LLM and an object detector.	This paper is important because it addresses a key limitation of current text-to-image diffusion models, which often struggle to accurately interpret and follow complex prompts. SLD provides a training-free method to improve the alignment between generated images and text prompts, leading to more accurate and reliable text-to-image generation.	SLD employs a closed-loop approach. First, an image is generated from the input prompt using an off-the-shelf diffusion model. Then, an LLM parser extracts key objects from the prompt, which are then located in the image using an open-vocabulary object detector. Next, an LLM controller compares the detected objects with the prompt and suggests corrections (addition, deletion, repositioning, attribute modification). Finally, these corrections are implemented in the latent space of the diffusion model to generate a corrected image. This process can be repeated iteratively.	SLD significantly improves image generation accuracy, particularly in handling numeracy, attribute binding, and spatial relationships. It outperforms existing methods on the LMD benchmark and shows significant improvements when applied to models like DALL-E 3. Additionally, SLD can be easily adapted for image editing tasks, achieving fine-grained control over object manipulation.	One limitation is the difficulty in handling objects with complex shapes due to limitations in the object segmentation module. Future work could explore better region selection methods for improved generation and editing quality. Additionally, the authors suggest exploring the integration of advanced LMMs for more streamlined image assessment and editing.	diffusion_model, llm, image_generation, image_editing, object_detection, self-correction, closed-loop
2403.04692	PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation	Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li	In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.	\model is a Diffusion Transformer model capable of directly generating high-quality images at 4K resolution, building upon its predecessor \modelalpha with enhanced training data and an efficient token compression mechanism.	This paper is significant as it addresses the challenge of efficiently training high-quality T2I models with limited resources. It introduces the concept of "weak-to-strong training," allowing for the incremental improvement of pre-trained models. Furthermore, \model pushes the boundary of resolution in T2I generation to 4K, a significant advancement in the field.	The authors employ a "weak-to-strong training" strategy, starting with the pre-trained \modelalpha. They enhance the model by: (1) Curating a higher-quality dataset with better aesthetics, higher resolution (up to 4K), and more accurate and dense captions. (2) Introducing an efficient token compression mechanism within the DiT framework to handle the increased computational demands of 4K generation. (3) Proposing efficient fine-tuning techniques for rapid adaptation to new VAEs, higher resolutions, and KV compression.	Key findings include: (1) \model achieves state-of-the-art 4K image generation with high fidelity and strong adherence to textual prompts. (2) The "weak-to-strong training" strategy proves highly efficient, requiring significantly fewer GPU days compared to training from scratch. (3) The proposed KV compression mechanism effectively reduces training and inference time without compromising quality. (4) Both human and AI preference studies confirm \model's superior performance over existing open-source models and competitive results with commercial T2I products.	Limitations include the inability to perfectly generate certain objects and scenes like text and hands, limitations in handling complex prompts, and potential biases in generated content. Future work involves improving data quality, scaling model size, enhancing alignment with complex instructions, and addressing ethical concerns related to biases and sensitive content.	diffusion_model, dit, t2i, 4k, text-to-image, high-resolution, efficient training, token compression, weak-to-strong training
2309.00013	Model Inversion Attack via Dynamic Memory Learning	Gege Qi, YueFeng Chen, Xiaofeng Mao, Binyuan Hui, Xiaodan Li, Rong Zhang, Hui Xue	Model Inversion (MI) attacks aim to recover the private training data from the target model, which has raised security concerns about the deployment of DNNs in practice. Recent advances in generative adversarial models have rendered them particularly effective in MI attacks, primarily due to their ability to generate high-fidelity and perceptually realistic images that closely resemble the target data. In this work, we propose a novel Dynamic Memory Model Inversion Attack (DMMIA) to leverage historically learned knowledge, which interacts with samples (during the training) to induce diverse generations. DMMIA constructs two types of prototypes to inject the information about historically learned knowledge: Intra-class Multicentric Representation (IMR) representing target-related concepts by multiple learnable prototypes, and Inter-class Discriminative Representation (IDR) characterizing the memorized samples as learned prototypes to capture more privacy-related information. As a result, our DMMIA has a more informative representation, which brings more diverse and discriminative generated results. Experiments on multiple benchmarks show that DMMIA performs better than state-of-the-art MI attack methods.	This paper introduces DMMIA, a novel model inversion attack method that leverages dynamic memory mechanisms to recover private training data from trained deep neural networks, addressing the catastrophic forgetting issue in existing GAN-based attacks.	This paper is important because it exposes a significant vulnerability in trained DNN models, demonstrating that sensitive information about training data can be effectively extracted even without direct access to the data itself.	The authors propose DMMIA, which uses two types of memory prototypes: Intra-class Multicentric Representation (IMR) for capturing diverse target-related concepts and Inter-class Discriminative Representation (IDR) for distinguishing between classes. These prototypes are progressively updated during training, enabling the attack to retain previously learned features and enhance the diversity and realism of generated samples.	DMMIA achieves state-of-the-art attack performance on multiple benchmark datasets, including CelebA, FaceScrub, and Stanford Dogs, outperforming existing methods in terms of attack success rate, sample realism (FID), and sample diversity metrics. Notably, it demonstrates significant improvements when attacking models trained on datasets with limited image priors, highlighting its effectiveness in scenarios where the attacker has less knowledge about the target data distribution.	The authors acknowledge the dependence of attack success on the diversity of the image prior used in pre-training the StyleGAN2 generator. Future work could explore ways to improve the attack's effectiveness when prior knowledge about the target data is limited. Additionally, extending DMMIA to black-box settings, where the attacker only has access to the model's predictions, is mentioned as a potential research direction.	model_inversion_attack, gan, adversarial_attack, interpretability, privacy, dynamic_memory, prototype_learning
2308.09124	Linearity of Relation Decoding in Transformer Language Models	Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau	Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a first-order approximation to the LM from a single prompt, and they exist for a variety of factual, commonsense, and linguistic relations. However, we also identify many cases in which LM predictions capture relational knowledge accurately, but this knowledge is not linearly encoded in their representations. Our results thus reveal a simple, interpretable, but heterogeneously deployed knowledge representation strategy in transformer LMs.	This paper investigates how transformer language models (LLMs) represent relational knowledge, finding that a subset of relations can be approximated by linear transformations applied to subject representations.	This work sheds light on the internal mechanisms of LLMs, revealing that some aspects of their knowledge representation are surprisingly simple and interpretable. This finding contributes to our understanding of how LLMs store and process information, potentially enabling more transparent and controllable AI systems.	The authors manually curate a dataset of relations and corresponding subject-object pairs. They then estimate a linear relational embedding (LRE) for each relation by calculating the Jacobian of the model's computation on a prompt designed to elicit the relation. They evaluate the faithfulness of the LRE by measuring how well it predicts the model's output for new subjects, and its causality by using it to edit subject representations and induce the model to predict different objects.	The research shows that LREs can faithfully approximate LLM relation decoding for a significant portion of the tested relations. They also demonstrate the causal influence of these LREs by successfully manipulating model predictions via representation editing. Interestingly, the study reveals that not all relations are linearly encoded, suggesting a more complex, non-linear processing mechanism for certain types of information.	The paper acknowledges limitations in the dataset size, the reliance on first-token correctness as an evaluation metric, and the assumption of single correct objects for relations. Future work could address these limitations, exploring a wider range of relations, refining the evaluation scheme, and investigating how LREs could be used to understand and mitigate biases in LLMs.	llm, analysis, interpretability, knowledge_representation, relation, linear_transformation
2310.01506	Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code	Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, Qiang Xu	Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.	This paper introduces DirectInversion, a novel technique for inverting diffusion models in text-based image editing, which disentangles the source and target diffusion branches to excel in content preservation and edit fidelity, respectively.	The paper addresses limitations in existing diffusion model inversion techniques used for text-based image editing, which often rely on computationally expensive optimization and may compromise either content preservation or edit fidelity. The authors argue for a disentangled approach to optimize both aspects, and introduce a new benchmark dataset for evaluation.	The authors propose DirectInversion, which directly rectifies the deviation path in the source branch using a simple three-line code modification to DDIM inversion. They introduce PIE-Bench, a new benchmark dataset with 700 images and diverse editing categories, to evaluate their method across 8 different editing techniques and against existing inversion methods using 7 evaluation metrics.	DirectInversion demonstrates superior performance compared to existing optimization-based inversion methods, achieving significant improvements in essential content preservation (up to 83.2% enhancement in Structure Distance) and edit fidelity (up to 8.8% improvement in Edit Region Clip Similarity), while being significantly faster. The method also improves content preservation by up to 20.2% and edit fidelity by up to 2.5% when integrated with other editing techniques.	The authors acknowledge limitations inherited from existing diffusion-based editing methods, such as instability and low success rates in certain complex editing scenarios. Future work includes extending the approach to video editing, developing more robust editing models, and creating more comprehensive evaluation metrics.	diffusion_model, image_editing, inversion, benchmark, content_preservation, edit_fidelity
2402.01293	Can MLLMs Perform Text-to-Image In-Context Learning?	Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee	The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties, leading to notable improvements in performance. Our code and dataset are available at https://github.com/UW-Madison-Lee-Lab/CoBSAT.	This paper introduces the concept of Text-to-Image In-Context Learning (T2I-ICL), where Multimodal Large Language Models (MLLMs) generate images based on textual prompts and example image-text pairs, and presents CoBSAT, a new benchmark dataset to evaluate MLLMs' performance on T2I-ICL tasks.	This paper addresses the under-explored area of T2I-ICL, contrasting it with the more common Image-to-Text ICL, and provides a benchmark for evaluating and understanding the capabilities of MLLMs in this domain, which is crucial for applications like product design and personalized content creation.	The authors created CoBSAT, a dataset with 10 tasks covering five themes (color, background, style, action, texture), each with object-inference and attribute-inference variations. They evaluated six state-of-the-art MLLMs on this dataset using CLIP and LLaVA as evaluation metrics to assess the accuracy of generated images or image descriptions against true labels.	The study found that existing MLLMs struggle with T2I-ICL, with SEED-LLaMA performing best in image generation and Gemini, Qwen-VL, and GPT-4V excelling in generating image descriptions. The paper also identifies multimodality and image generation as key challenges in T2I-ICL. Notably, fine-tuning models on CoBSAT and incorporating Chain-of-Thought prompting led to significant performance improvements.	The paper acknowledges limitations in demonstration selection and the need to explore additional prompt engineering techniques like Tree-of-Thought and self-consistency sampling. Future work includes expanding CoBSAT with more themes and attributes, focusing on image editing tasks, and developing multimodal prompt engineering techniques.	diffusion_model, llm, mllm, analysis, benchmark, dataset, image_generation, in-context learning, multimodality, prompt_engineering
2312.00777	VideoBooth: Diffusion-based Video Generation with Image Prompts	Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu	Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.	This paper introduces VideoBooth, a novel framework for generating videos using both text prompts and image prompts for customized content creation.	This paper is important because it addresses limitations in text-driven video generation by incorporating image prompts for more precise control over subject appearance, which is crucial for customized content creation.	The authors propose a coarse-to-fine visual embedding strategy: 1) A CLIP image encoder extracts coarse visual embeddings from image prompts, capturing high-level semantic information. 2) Fine visual embeddings are extracted through an attention injection module, incorporating multi-scale image prompts into cross-frame attention layers for refining details and maintaining temporal consistency. The authors also created a dedicated VideoBooth dataset for training and evaluating their model.	VideoBooth demonstrates state-of-the-art performance in generating high-quality, customized videos, effectively preserving visual attributes from image prompts while maintaining alignment with text prompts. Ablation studies confirm the effectiveness of the coarse-to-fine training strategy and both embedding modules.	The authors acknowledge the potential negative societal impact of generating fake videos and suggest exploring advanced fake video detection methods as future work. Additionally, processing the full WebVid dataset and expanding the VideoBooth dataset is mentioned as future work.	diffusion_model, video, generation, image_prompt, customized_content_creation, attention_mechanism
2309.03886	FIND: A Function Description Benchmark for Evaluating Interpretability Methods	Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba	Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions span textual and numeric domains, and involve a range of real-world complexities. We evaluate methods that use pretrained language models (LMs) to produce descriptions of function behavior in natural language and code. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built from an LM with black-box access to functions, can infer function structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, AIA descriptions tend to capture global function behavior and miss local details. These results suggest that FIND will be useful for evaluating more sophisticated interpretability methods before they are applied to real-world models.	This paper introduces FIND (Function Interpretation and Description), a benchmark suite for evaluating the ability of automated methods to interpret and describe the behavior of black-box functions.	This paper addresses the growing need for automated interpretability methods for increasingly complex AI models by introducing a benchmark to evaluate and compare these methods on functions with known structures.	The authors constructed FIND, a benchmark suite containing over 2000 procedurally generated functions with varying complexity and domains, including numeric, string, and synthetic neural modules. They evaluate different interpretation methods, including non-interactive (MILAN-like) and interactive (Automated Interpretability Agents), using off-the-shelf LMs like GPT-4, GPT-3.5, and Llama-2. Evaluation involves comparing generated descriptions with ground-truth explanations using code execution accuracy and a novel unit-testing protocol with a fine-tuned Vicuna-13b as an evaluator.	GPT-4 consistently outperforms other LMs as an interpretability agent, demonstrating the potential of LMs for automated interpretability. However, even GPT-4 struggles with complex functions, highlighting the need for additional tools and techniques beyond current LMs. Initializing the AIA with exemplars dramatically improves performance, suggesting the importance of strategic data selection. The unit-testing protocol with the fine-tuned Vicuna evaluator demonstrates strong agreement with human judgments.	The authors acknowledge that FIND focuses solely on black-box interpretation and lacks evaluation on real-world models. Future work will extend FIND to encompass white-box interpretation problems, including descriptions of individual components within neural circuits. Additionally, the authors aim to explore tools for enhanced sampling and fine-tuning LMs specifically for interpretability.	diffusion_model, llm, analysis, interpretability
2311.10538	Testing Language Model Agents Safely in the Wild	Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau	A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We design a basic safety monitor (AgentMonitor) that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the AgentMonitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.	This paper proposes a framework for conducting safe tests of autonomous language model agents on the open internet by introducing a context-sensitive safety monitor that can identify and stop unsafe agent actions.	As language model agents become increasingly capable and prevalent, it's crucial to ensure they are tested safely in real-world environments to prevent potential harm and build trust in their deployment.	The authors developed a dataset of agent outputs, including manually crafted unsafe examples, and designed a safety monitor (AgentMonitor) based on GPT-3.5-turbo. They trained and evaluated the monitor's ability to identify and stop unsafe actions using various parameters like task context, previous actions, and whitelists.	The AgentMonitor achieved promising results on a test set, with an F1 score of 89.4%. Ablation studies revealed that access to the agent's previous context was crucial for the monitor's performance. The authors also highlighted the need for well-specified threat models and comprehensive example sets for few-shot learning in the monitor.	The authors identify limitations such as the need for larger, better-categorized datasets of attacks and a clearer distinction between off-task and unsafe outputs. Future work will focus on improving the AgentMonitor's ability to make this distinction, minimizing the need for human intervention in safe testing.	llm, analysis, safety, autonomous_agent, testing
2308.09889	DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization	Xiaoyu Ye, Hao Huang, Jiaqi An, Yongtao Wang	Stable Diffusion (SD) customization approaches enable users to personalize SD model outputs, greatly enhancing the flexibility and diversity of AI art. However, they also allow individuals to plagiarize specific styles or subjects from copyrighted images, which raises significant concerns about potential copyright infringement. To address this issue, we propose an invisible data-free universal adversarial watermark (DUAW), aiming to protect a myriad of copyrighted images from different customization approaches across various versions of SD models. First, DUAW is designed to disrupt the variational autoencoder during SD customization. Second, DUAW operates in a data-free context, where it is trained on synthetic images produced by a Large Language Model (LLM) and a pretrained SD model. This approach circumvents the necessity of directly handling copyrighted images, thereby preserving their confidentiality. Once crafted, DUAW can be imperceptibly integrated into massive copyrighted images, serving as a protective measure by inducing significant distortions in the images generated by customized SD models. Experimental results demonstrate that DUAW can effectively distort the outputs of fine-tuned SD models, rendering them discernible to both human observers and a simple classifier.	This paper introduces DUAW, a data-free universal adversarial watermark designed to protect copyrighted images from being used for unauthorized customization of Stable Diffusion models.	The paper addresses the growing concern of copyright infringement facilitated by AI art customization tools. It offers a practical solution to protect intellectual property in the rapidly evolving field of AI-generated content.	The authors develop DUAW by training it on synthetic images generated using a Large Language Model (LLM) and a pre-trained SD model. This data-free approach ensures confidentiality of the copyrighted images. The watermark disrupts the variational autoencoder (VAE) within SD models during customization, leading to distorted outputs when the customized model is used for generation.	Experimental results demonstrate that DUAW effectively distorts images generated by customized SD models trained on watermarked images. This distortion is noticeable to human observers and detectable by a simple classifier, achieving high protection success rates. DUAW also exhibits strong transferability across different SD versions and VAE variants.	The paper acknowledges the potential impact of image interference techniques on DUAW's robustness, although its effectiveness remains high. Future work could focus on enhancing robustness against more sophisticated interference methods and exploring DUAW's applicability to other diffusion-based models.	diffusion_model, adversarial_watermark, copyright_protection, stable_diffusion, data-free, vae, llm, image_generation
2311.14097	ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models	Fei Kong, Jinhao Duan, Lichao Sun, Hao Cheng, Renjing Xu, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu	Though diffusion models excel in image generation, their step-by-step denoising leads to slow generation speeds. Consistency training addresses this issue with single-step sampling but often produces lower-quality generations and requires high training costs. In this paper, we show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. As timestep increases, the upper bound accumulates previous consistency training losses. Therefore, larger batch sizes are needed to reduce both current and accumulated losses. We propose Adversarial Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS) divergence between distributions at each timestep using a discriminator. Theoretically, ACT enhances generation quality, and convergence. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$\times$64 and LSUN Cat 256$\times$256 datasets, retains zero-shot image inpainting capabilities, and uses less than $1/6$ of the original batch size and fewer than $1/2$ of the model parameters and training steps compared to the baseline method, this leads to a substantial reduction in resource consumption. Our code is available:https://github.com/kong13661/ACT	This paper introduces Adversarial Consistency Training (ACT), a novel method that enhances single-step image generation in consistency training by incorporating a discriminator, leading to faster sampling, reduced resource requirements, and improved generation quality compared to standard consistency training.	The paper addresses the limitations of diffusion models, particularly slow generation speeds due to iterative denoising. While consistency training offers faster sampling, it often compromises generation quality. This research is important because it presents a more efficient and effective approach to single-step image generation using adversarial training within the consistency model framework.	The authors first theoretically demonstrate that optimizing consistency training loss minimizes the Wasserstein distance between generated and target distributions, requiring large batch sizes to mitigate accumulating errors. To overcome this, they incorporate a discriminator that directly minimizes the Jensen-Shannon divergence between the distributions at each timestep, similar to GANs. This approach aims to enhance training efficiency and generation quality. The authors conduct experiments on CIFAR10, ImageNet 64x64, and LSUN Cat 256x256 datasets, comparing ACT with existing methods. Additionally, they perform ablation studies to analyze the impact of different components and hyperparameters on the model's performance.	The proposed ACT method demonstrates superior FID scores compared to standard consistency training on all tested datasets while significantly reducing batch size, model parameters, and training steps. It achieves an FID of 6.0 on CIFAR10 with a batch size of 80, outperforming consistency training with a batch size of 512 (FID 8.7). Similar improvements are observed on ImageNet and LSUN Cat datasets, highlighting ACT's effectiveness and efficiency.	The authors acknowledge the need for further exploration of the interaction between consistency training loss and adversarial loss for optimizing ACT. They also suggest exploring alternative distance metrics beyond Jensen-Shannon divergence to minimize the gap between distributions. Future research could focus on these aspects to further enhance the performance and stability of ACT.	diffusion_model, gan, image_generation, consistency_training, adversarial_training, fast_sampling, resource_efficiency
2401.07519	InstantID: Zero-shot Identity-Preserving Generation in Seconds	Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, Yao Hu	There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.	This paper introduces InstantID, a novel plug-and-play diffusion model module for identity-preserving image generation that uses a single facial image to generate personalized images in various styles with high fidelity.	This paper is important because it addresses limitations of existing personalized image synthesis methods that are either computationally expensive, require multiple reference images, or lack fidelity in preserving identity. InstantID offers a fast, efficient, and high-fidelity solution for real-world applications like e-commerce and AI portraits.	The authors develop InstantID with three main components: 1) ID embedding using a pre-trained face model for strong identity features. 2) A lightweight image adapter module with decoupled cross-attention for image prompt integration. 3) IdentityNet, an adapted ControlNet using facial landmarks and ID embedding as conditions for preserving complex facial features. The model is trained on large-scale datasets like LAION-Face, optimizing only the adapter and IdentityNet while freezing the pre-trained diffusion model.	InstantID demonstrates superior performance in preserving identity while maintaining stylistic flexibility, outperforming existing methods like IP-Adapter and achieving competitive results with LoRA models without requiring multiple images or training. It shows robustness, prompt editability, compatibility with ControlNet, and enables novel applications like novel view synthesis, identity interpolation, and multi-identity synthesis.	Limitations include the highly coupled facial attributes in ID embedding and potential biases from the face model used. Future work could focus on decoupling facial attributes for better editing and addressing ethical considerations related to potential misuse.	diffusion_model, identity_preserving, image_generation, face_embedding, controlnet, plug-and-play, single-shot, high-fidelity, image_synthesis
2310.07702	ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models	Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan	In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.	This paper investigates the generation of high-resolution images from pre-trained diffusion models, addressing the issue of object repetition and unreasonable structures often observed in direct high-resolution generation.	This research is significant because it offers a solution to generate high-quality images at resolutions exceeding the training data, crucial for applications demanding large image sizes like advertisements, without requiring extensive retraining or fine-tuning.	The authors analyze the structural components of diffusion models, identifying limited convolutional receptive fields as the root cause for object repetition. They propose 're-dilation,' a method to dynamically adjust the convolutional perception field during inference, and 'convolution dispersion' with 'noise-damped classifier-free guidance' to enhance generation quality at ultra-high resolutions.	The proposed re-dilation method successfully mitigates object repetition issues and outperforms direct inference and attention scaling methods in terms of FID and KID scores across different Stable Diffusion versions and resolutions. The method also demonstrates superior texture detail preservation compared to a pre-trained super-resolution model. Furthermore, the approach generalizes well to text-to-video generation, enabling higher-resolution video synthesis without sacrificing image definition.	The paper acknowledges limitations in evaluating texture definition using FID and KID, relying on a user preference study for assessment. Future work may explore optimizing the trade-off between image fidelity and denoising capabilities at ultra-high resolutions. Additionally, investigating the impact of re-dilation on other diffusion model applications like image editing and style transfer is suggested.	diffusion_model, high_resolution, image_synthesis, re-dilation, convolution, perception_field, text-to-image, text-to-video, stable diffusion