Hello

Table

  • FIFO-Diffusion: Generating Infinite Videos from Text without Training 2405.11473
  • Compositional Text-to-Image Generation with Dense Blob Representations 2405.08246
  • The Platonic Representation Hypothesis 2405.07987
  • Controllable Image Generation With Composed Parallel Token Prediction 2405.06535
  • Distilling Diffusion Models into Conditional GANs 2405.05967
  • Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models 2405.05846
  • MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation 2405.05806
  • A Survey on Personalized Content Synthesis with Diffusion Models 2405.05538
  • Variational Schrödinger Diffusion Models 2405.04795
  • xLSTM: Extended Long Short-Term Memory 2405.04517
  • Vision Mamba: A Comprehensive Survey and Taxonomy 2405.04404
  • Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer 2405.04312
  • Video Diffusion Models: A Survey 2405.03150
  • U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers 2405.02730
  • Customizing Text-to-Image Models with a Single Image Pair 2405.01536
  • LocInv: Localization-aware Inversion for Text-Guided Image Editing 2405.01496
  • Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance 2405.01356
  • On Mechanistic Knowledge Localization in Text-to-Image Generative Models 2405.01008
  • Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models 2405.00760
  • KAN: Kolmogorov-Arnold Networks 2404.19756
  • Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation 2404.19752
  • Espresso: Robust Concept Filtering in Text-to-Image Models 2404.19227
  • Stylus: Automatic Adapter Selection for Diffusion Models 2404.18928
  • A Survey on Vision Mamba: Models, Applications and Challenges 2404.18861
  • CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data 2404.15653
  • GLoD: Composing Global Contexts and Local Details in Image Generation 2404.15447
  • Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data 2404.14367
  • Analysis of Classifier-Free Guidance Weight Schedulers 2404.13040
  • Lazy Diffusion Transformer for Interactive Image Editing 2404.12382
  • Dynamic Typography: Bringing Text to Life via Video Diffusion Prior 2404.11614
  • Probing the 3D Awareness of Visual Foundation Models 2404.08636
  • Connecting NeRFs, Images, and Text 2404.07993
  • View Selection for 3D Captioning via Diffusion Ranking 2404.07984
  • Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models 2404.07724
  • CAT: Contrastive Adapter Training for Personalized Image Generation 2404.07554
  • Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs 2404.07449
  • DiffHarmony: Latent Diffusion Model Meets Image Harmonization 2404.06139
  • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders 2404.05961
  • Finding Visual Task Vectors 2404.05729
  • SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing 2404.05717
  • MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation 2404.05674
  • A Training-Free Plug-and-Play Watermark Framework for Stable Diffusion 2404.05607
  • UniFL: Improve Stable Diffusion via Unified Feedback Learning 2404.05595
  • Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance 2404.05384
  • Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt 2404.05331
  • MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators 2404.05014
  • Dynamic Prompt Optimizing for Text-to-Image Generation 2404.04095
  • Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models 2404.03913
  • RL for Consistency Models: Faster Reward Guided Text-to-Image Generation 2404.03673
  • Robust Concept Erasure Using Task Vectors 2404.03631
  • LCM-Lookahead for Encoder-based Text-to-Image Personalization 2404.03620
  • ReFT: Representation Finetuning for Language Models 2404.03592
  • On the Scalability of Diffusion-based Text-to-Image Generation 2404.02883
  • Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models 2404.02747
  • LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP 2404.02285
  • Mixture-of-Depths: Dynamically allocating compute in transformer-based language models 2404.02258
  • Iterated Learning Improves Compositionality in Large Vision-Language Models 2404.02145
  • Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models 2404.01231
  • TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias 2404.00384
  • Capability-aware Prompt Reformulation Learning for Text-to-Image Generation 2403.19716
  • TextCraftor: Your Text Encoder Can be Image Quality Controller 2403.18978
  • Attention Calibration for Disentangled Text-to-Image Personalization 2403.18551
  • Tutorial on Diffusion Models for Imaging and Vision 2403.18103
  • Improving Text-to-Image Consistency via Automatic Prompt Optimization 2403.17804
  • Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance 2403.17377
  • SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions 2403.16627
  • Long-CLIP: Unlocking the Long-Text Capability of CLIP 2403.15378
  • ReNoise: Real Image Inversion Through Iterative Noising 2403.14602
  • MyVLM: Personalizing VLMs for User-Specific Queries 2403.14599
  • Implicit Style-Content Separation using B-LoRA 2403.14572
  • Editing Massive Concepts in Text-to-Image Diffusion Models 2403.13807
  • Evolutionary Optimization of Model Merging Recipes 2403.13187
  • When Do We Not Need Larger Vision Models? 2403.13043
  • FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis 2403.12963
  • You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs 2403.12931
  • Graph Neural Networks for Learning Equivariant Representations of Neural Networks 2403.12143
  • Reward Guided Latent Consistency Distillation 2403.11027
  • Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation 2403.07860
  • ORPO: Monolithic Preference Optimization without Reference Model 2403.07691
  • Stealing Part of a Production Language Model 2403.06634
  • ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment 2403.05135
  • PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation 2403.04692
  • What do we learn from inverting CLIP models? 2403.02580
  • Model Lakes 2403.02327
  • Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models 2402.19427
  • WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts 2402.18956
  • Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models 2402.17177
  • Transparent Image Layer Diffusion using Latent Transparency 2402.17113
  • Asymmetry in Low-Rank Adapters of Foundation Models 2402.16842
  • Training Neural Networks from Scratch with Parallel Low-Rank Adapters 2402.16828
  • Advancing Parameter Efficiency in Fine-tuning via Representation Editing 2402.15179
  • Fine-tuning CLIP Text Encoders with Two-step Paraphrasing 2402.15120
  • Consolidating Attention Features for Multi-view Image Editing 2402.14792
  • SDXL-Lightning: Progressive Adversarial Diffusion Distillation 2402.13929
  • LoRA+: Efficient Low Rank Adaptation of Large Models 2402.12354
  • Direct Consistency Optimization for Compositional Text-to-Image Personalization 2402.12004
  • Aligning Modalities in Vision Large Language Models via Preference Fine-tuning 2402.11411
  • Speculative Streaming: Fast LLM Inference without Auxiliary Models 2402.11131
  • Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation 2402.10491
  • Recovering the Pre-Fine-Tuning Weights of Generative Models 2402.10208
  • Large Language Models: A Survey 2402.06196
  • Can MLLMs Perform Text-to-Image In-Context Learning? 2402.01293
  • Compositional Generative Modeling: A Single Model is Not All You Need 2402.01103
  • AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error 2401.17879
  • Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding 2401.15708
  • Lumiere: A Space-Time Diffusion Model for Video Generation 2401.12945
  • West-of-N: Synthetic Preference Generation for Improved Reward Modeling 2401.12086
  • Edit One for All: Interactive Batch Image Editing 2401.10219
  • Self-Rewarding Language Models 2401.10020
  • SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers 2401.08740
  • Benchmarking the Robustness of Image Watermarks 2401.08573
  • InstantID: Zero-shot Identity-Preserving Generation in Seconds 2401.07519
  • Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning 2401.06805
  • Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs 2401.06209
  • Score Distillation Sampling with Learned Manifold Corrective 2401.05293
  • A Minimaximalist Approach to Reinforcement Learning from Human Feedback 2401.04056
  • Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models 2401.01335
  • Diffusion Model with Perceptual Loss 2401.00110
  • V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs 2312.14135
  • The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction 2312.13558
  • Generative Multimodal Models are In-Context Learners 2312.13286
  • Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment 2312.12148
  • Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models 2312.10835
  • Perspectives on the State and Future of Deep Learning - 2023 2312.09323
  • Vision-Language Models as a Source of Rewards 2312.09187
  • DiffusionLight: Light Probes for Free by Painting a Chrome Ball 2312.09168
  • A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions 2312.08578
  • Accelerating the Global Aggregation of Local Explanations 2312.07991
  • DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing 2312.07409
  • CAD: Photorealistic 3D Generation via Adversarial Distillation 2312.06663
  • Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior 2312.06655
  • Using Captum to Explain Generative Language Models 2312.05491
  • SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation 2312.05239
  • Localized Symbolic Knowledge Distillation for Visual Commonsense Models 2312.04837
  • Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models 2312.04410
  • Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment 2312.03766
  • Return of Unconditional Generation: A Self-supervised Representation Generation Method 2312.03701
  • FaceStudio: Put Your Face Everywhere in Seconds 2312.02663
  • Object Recognition as Next Token Prediction 2312.02142
  • DiffiT: Diffusion Vision Transformers for Image Generation 2312.02139
  • Style Aligned Image Generation via Shared Attention 2312.02133
  • GIVT: Generative Infinite-Vocabulary Transformers 2312.02116
  • Sequential Modeling Enables Scalable Learning for Large Vision Models 2312.00785
  • VideoBooth: Diffusion-based Video Generation with Image Prompts 2312.00777
  • One-step Diffusion with Distribution Matching Distillation 2311.18828
  • Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing 2311.18608
  • HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation 2311.18158
  • PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation 2311.17086
  • DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling 2311.17082
  • Adversarial Diffusion Distillation 2311.17042
  • Scalable Extraction of Training Data from (Production) Language Models 2311.17035
  • Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer 2311.17009
  • Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following 2311.17002
  • DemoFusion: Democratising High-Resolution Image Generation With No $$$ 2311.16973
  • Self-correcting LLM-controlled Diffusion Models 2311.16090
  • Enhancing Diffusion Models with Text-Encoder Reinforcement Learning 2311.15657
  • Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets 2311.15127
  • ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models 2311.14097
  • Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models 2311.13833
  • ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs 2311.13600
  • Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model 2311.13231
  • MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning 2311.13127
  • Diffusion Model Alignment Using Direct Preference Optimization 2311.12908
  • Toward effective protection against diffusion based mimicry through score distillation 2311.12832
  • NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation 2311.12229
  • Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models 2311.12092
  • An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis 2311.11919
  • Exponentially Faster Language Modelling 2311.10770
  • Testing Language Model Agents Safely in the Wild 2311.10538
  • High-fidelity Person-centric Subject-to-Image Synthesis 2311.10329
  • The Chosen One: Consistent Characters in Text-to-Image Diffusion Models 2311.10093
  • UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs 2311.09257
  • First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models 2311.05020
  • Future Lens: Anticipating Subsequent Tokens from a Single Hidden State 2311.04897
  • Instruct Me More! Random Prompting for Visual In-Context Learning 2311.03648
  • Cross-Image Attention for Zero-Shot Appearance Transfer 2311.03335
  • Idempotent Generative Network 2311.01462
  • The Expressive Power of Low-Rank Adaptation 2310.17513
  • Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution 2310.16834
  • MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion 2310.14729
  • Localizing and Editing Knowledge in Text-to-Image Generative Models 2310.13730
  • On the Language Encoder of Contrastive Cross-modal Models 2310.13267
  • An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning 2310.12274
  • Quality Diversity through Human Feedback 2310.12103
  • A General Theoretical Paradigm to Understand Learning from Human Preferences 2310.12036
  • To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images … For Now 2310.11868
  • Context-Aware Meta-Learning 2310.10971
  • ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models 2310.07702
  • State of the Art on Diffusion Models for Visual Computing 2310.07204
  • Interpreting CLIP’s Image Representation via Text-Based Decomposition 2310.05916
  • NEFTune: Noisy Embeddings Improve Instruction Finetuning 2310.05914
  • No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling 2310.05654
  • Improving Adversarial Attacks on Latent Diffusion Model 2310.04687
  • Aligning Text-to-Image Diffusion Models with Reward Backpropagation 2310.03739
  • Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion 2310.03502
  • Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code 2310.01506
  • PixArt-: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis 2310.00426
  • Directly Fine-Tuning Diffusion Models on Differentiable Rewards 2309.17400
  • Demystifying CLIP Data 2309.16671
  • Generative Escher Meshes 2309.14564
  • TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance 2309.12314
  • FreeU: Free Lunch in Diffusion U-Net 2309.11497
  • On Model Explanations with Transferable Neural Pathways 2309.09887
  • Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models 2309.07986
  • Generative Image Dynamics 2309.07906
  • Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement 2309.07254
  • PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models 2309.05793
  • MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers 2309.04372
  • Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis 2309.03904
  • FIND: A Function Description Benchmark for Evaluating Interpretability Methods 2309.03886
  • Model Inversion Attack via Dynamic Memory Learning 2309.00013
  • Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images 2308.16582
  • MVDream: Multi-view Diffusion for 3D Generation 2308.16512
  • Elucidating the Exposure Bias in Diffusion Models 2308.15321
  • Unified Concept Editing in Diffusion Models 2308.14761
  • Reinforcement Learning for Generative AI: A Survey 2308.14328
  • APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency 2308.12605
  • Diffusion Model as Representation Learner 2308.10916
  • Backdooring Textual Inversion for Concept Censorship 2308.10718
  • Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks 2308.10187
  • AltDiffusion: A Multilingual Text-to-Image Diffusion Model 2308.09991
  • DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization 2308.09889
  • RLIPv2: Fast Scaling of Relational Language-Image Pre-training 2308.09351
  • Linearity of Relation Decoding in Transformer Language Models 2308.09124
  • Watch Your Steps: Local Image and Scene Editing by Text Instructions 2308.08947
  • ALIP: Adaptive Language-Image Pre-training with Synthetic Caption 2308.08428
  • DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory 2308.08089
  • CoDeF: Content Deformation Fields for Temporally Consistent Video Processing 2308.07926
  • StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models 2308.07863
  • Boosting Multi-modal Model Performance with Adaptive Gradient Modulation 2308.07686
  • A Review of Adversarial Attacks in Computer Vision 2308.07673
  • Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training 2308.07665
  • Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval 2308.07648