Hello
Table
FIFO-Diffusion: Generating Infinite Videos from Text without Training 2405.11473
Compositional Text-to-Image Generation with Dense Blob Representations 2405.08246
The Platonic Representation Hypothesis 2405.07987
Controllable Image Generation With Composed Parallel Token Prediction 2405.06535
Distilling Diffusion Models into Conditional GANs 2405.05967
Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models 2405.05846
MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation 2405.05806
A Survey on Personalized Content Synthesis with Diffusion Models 2405.05538
Variational Schrödinger Diffusion Models 2405.04795
xLSTM: Extended Long Short-Term Memory 2405.04517
Vision Mamba: A Comprehensive Survey and Taxonomy 2405.04404
Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer 2405.04312
Video Diffusion Models: A Survey 2405.03150
U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers 2405.02730
Customizing Text-to-Image Models with a Single Image Pair 2405.01536
LocInv: Localization-aware Inversion for Text-Guided Image Editing 2405.01496
Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance 2405.01356
On Mechanistic Knowledge Localization in Text-to-Image Generative Models 2405.01008
Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models 2405.00760
KAN: Kolmogorov-Arnold Networks 2404.19756
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation 2404.19752
Espresso: Robust Concept Filtering in Text-to-Image Models 2404.19227
Stylus: Automatic Adapter Selection for Diffusion Models 2404.18928
A Survey on Vision Mamba: Models, Applications and Challenges 2404.18861
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data 2404.15653
GLoD: Composing Global Contexts and Local Details in Image Generation 2404.15447
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data 2404.14367
Analysis of Classifier-Free Guidance Weight Schedulers 2404.13040
Lazy Diffusion Transformer for Interactive Image Editing 2404.12382
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior 2404.11614
Probing the 3D Awareness of Visual Foundation Models 2404.08636
Connecting NeRFs, Images, and Text 2404.07993
View Selection for 3D Captioning via Diffusion Ranking 2404.07984
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models 2404.07724
CAT: Contrastive Adapter Training for Personalized Image Generation 2404.07554
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs 2404.07449
DiffHarmony: Latent Diffusion Model Meets Image Harmonization 2404.06139
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders 2404.05961
Finding Visual Task Vectors 2404.05729
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing 2404.05717
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation 2404.05674
A Training-Free Plug-and-Play Watermark Framework for Stable Diffusion 2404.05607
UniFL: Improve Stable Diffusion via Unified Feedback Learning 2404.05595
Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance 2404.05384
Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt 2404.05331
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators 2404.05014
Dynamic Prompt Optimizing for Text-to-Image Generation 2404.04095
Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models 2404.03913
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation 2404.03673
Robust Concept Erasure Using Task Vectors 2404.03631
LCM-Lookahead for Encoder-based Text-to-Image Personalization 2404.03620
ReFT: Representation Finetuning for Language Models 2404.03592
On the Scalability of Diffusion-based Text-to-Image Generation 2404.02883
Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models 2404.02747
LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP 2404.02285
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models 2404.02258
Iterated Learning Improves Compositionality in Large Vision-Language Models 2404.02145
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models 2404.01231
TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias 2404.00384
Capability-aware Prompt Reformulation Learning for Text-to-Image Generation 2403.19716
TextCraftor: Your Text Encoder Can be Image Quality Controller 2403.18978
Attention Calibration for Disentangled Text-to-Image Personalization 2403.18551
Tutorial on Diffusion Models for Imaging and Vision 2403.18103
Improving Text-to-Image Consistency via Automatic Prompt Optimization 2403.17804
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance 2403.17377
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions 2403.16627
Long-CLIP: Unlocking the Long-Text Capability of CLIP 2403.15378
ReNoise: Real Image Inversion Through Iterative Noising 2403.14602
MyVLM: Personalizing VLMs for User-Specific Queries 2403.14599
Implicit Style-Content Separation using B-LoRA 2403.14572
Editing Massive Concepts in Text-to-Image Diffusion Models 2403.13807
Evolutionary Optimization of Model Merging Recipes 2403.13187
When Do We Not Need Larger Vision Models? 2403.13043
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis 2403.12963
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs 2403.12931
Graph Neural Networks for Learning Equivariant Representations of Neural Networks 2403.12143
Reward Guided Latent Consistency Distillation 2403.11027
Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation 2403.07860
ORPO: Monolithic Preference Optimization without Reference Model 2403.07691
Stealing Part of a Production Language Model 2403.06634
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment 2403.05135
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation 2403.04692
What do we learn from inverting CLIP models? 2403.02580
Model Lakes 2403.02327
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models 2402.19427
WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts 2402.18956
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models 2402.17177
Transparent Image Layer Diffusion using Latent Transparency 2402.17113
Asymmetry in Low-Rank Adapters of Foundation Models 2402.16842
Training Neural Networks from Scratch with Parallel Low-Rank Adapters 2402.16828
Advancing Parameter Efficiency in Fine-tuning via Representation Editing 2402.15179
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing 2402.15120
Consolidating Attention Features for Multi-view Image Editing 2402.14792
SDXL-Lightning: Progressive Adversarial Diffusion Distillation 2402.13929
LoRA+: Efficient Low Rank Adaptation of Large Models 2402.12354
Direct Consistency Optimization for Compositional Text-to-Image Personalization 2402.12004
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning 2402.11411
Speculative Streaming: Fast LLM Inference without Auxiliary Models 2402.11131
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation 2402.10491
Recovering the Pre-Fine-Tuning Weights of Generative Models 2402.10208
Large Language Models: A Survey 2402.06196
Can MLLMs Perform Text-to-Image In-Context Learning? 2402.01293
Compositional Generative Modeling: A Single Model is Not All You Need 2402.01103
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error 2401.17879
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding 2401.15708
Lumiere: A Space-Time Diffusion Model for Video Generation 2401.12945
West-of-N: Synthetic Preference Generation for Improved Reward Modeling 2401.12086
Edit One for All: Interactive Batch Image Editing 2401.10219
Self-Rewarding Language Models 2401.10020
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers 2401.08740
Benchmarking the Robustness of Image Watermarks 2401.08573
InstantID: Zero-shot Identity-Preserving Generation in Seconds 2401.07519
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning 2401.06805
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs 2401.06209
Score Distillation Sampling with Learned Manifold Corrective 2401.05293
A Minimaximalist Approach to Reinforcement Learning from Human Feedback 2401.04056
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models 2401.01335
Diffusion Model with Perceptual Loss 2401.00110
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs 2312.14135
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction 2312.13558
Generative Multimodal Models are In-Context Learners 2312.13286
Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment 2312.12148
Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models 2312.10835
Perspectives on the State and Future of Deep Learning - 2023 2312.09323
Vision-Language Models as a Source of Rewards 2312.09187
DiffusionLight: Light Probes for Free by Painting a Chrome Ball 2312.09168
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions 2312.08578
Accelerating the Global Aggregation of Local Explanations 2312.07991
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing 2312.07409
CAD: Photorealistic 3D Generation via Adversarial Distillation 2312.06663
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior 2312.06655
Using Captum to Explain Generative Language Models 2312.05491
SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation 2312.05239
Localized Symbolic Knowledge Distillation for Visual Commonsense Models 2312.04837
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models 2312.04410
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment 2312.03766
Return of Unconditional Generation: A Self-supervised Representation Generation Method 2312.03701
FaceStudio: Put Your Face Everywhere in Seconds 2312.02663
Object Recognition as Next Token Prediction 2312.02142
DiffiT: Diffusion Vision Transformers for Image Generation 2312.02139
Style Aligned Image Generation via Shared Attention 2312.02133
GIVT: Generative Infinite-Vocabulary Transformers 2312.02116
Sequential Modeling Enables Scalable Learning for Large Vision Models 2312.00785
VideoBooth: Diffusion-based Video Generation with Image Prompts 2312.00777
One-step Diffusion with Distribution Matching Distillation 2311.18828
Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing 2311.18608
HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation 2311.18158
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation 2311.17086
DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling 2311.17082
Adversarial Diffusion Distillation 2311.17042
Scalable Extraction of Training Data from (Production) Language Models 2311.17035
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer 2311.17009
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following 2311.17002
DemoFusion: Democratising High-Resolution Image Generation With No $$$ 2311.16973
Self-correcting LLM-controlled Diffusion Models 2311.16090
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning 2311.15657
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets 2311.15127
ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models 2311.14097
Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models 2311.13833
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs 2311.13600
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model 2311.13231
MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning 2311.13127
Diffusion Model Alignment Using Direct Preference Optimization 2311.12908
Toward effective protection against diffusion based mimicry through score distillation 2311.12832
NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation 2311.12229
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models 2311.12092
An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis 2311.11919
Exponentially Faster Language Modelling 2311.10770
Testing Language Model Agents Safely in the Wild 2311.10538
High-fidelity Person-centric Subject-to-Image Synthesis 2311.10329
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models 2311.10093
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs 2311.09257
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models 2311.05020
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State 2311.04897
Instruct Me More! Random Prompting for Visual In-Context Learning 2311.03648
Cross-Image Attention for Zero-Shot Appearance Transfer 2311.03335
Idempotent Generative Network 2311.01462
The Expressive Power of Low-Rank Adaptation 2310.17513
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution 2310.16834
MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion 2310.14729
Localizing and Editing Knowledge in Text-to-Image Generative Models 2310.13730
On the Language Encoder of Contrastive Cross-modal Models 2310.13267
An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning 2310.12274
Quality Diversity through Human Feedback 2310.12103
A General Theoretical Paradigm to Understand Learning from Human Preferences 2310.12036
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images … For Now 2310.11868
Context-Aware Meta-Learning 2310.10971
ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models 2310.07702
State of the Art on Diffusion Models for Visual Computing 2310.07204
Interpreting CLIP’s Image Representation via Text-Based Decomposition 2310.05916
NEFTune: Noisy Embeddings Improve Instruction Finetuning 2310.05914
No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling 2310.05654
Improving Adversarial Attacks on Latent Diffusion Model 2310.04687
Aligning Text-to-Image Diffusion Models with Reward Backpropagation 2310.03739
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion 2310.03502
Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code 2310.01506
PixArt-α : Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis 2310.00426
Directly Fine-Tuning Diffusion Models on Differentiable Rewards 2309.17400
Demystifying CLIP Data 2309.16671
Generative Escher Meshes 2309.14564
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance 2309.12314
FreeU: Free Lunch in Diffusion U-Net 2309.11497
On Model Explanations with Transferable Neural Pathways 2309.09887
Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models 2309.07986
Generative Image Dynamics 2309.07906
Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement 2309.07254
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models 2309.05793
MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers 2309.04372
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis 2309.03904
FIND: A Function Description Benchmark for Evaluating Interpretability Methods 2309.03886
Model Inversion Attack via Dynamic Memory Learning 2309.00013
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images 2308.16582
MVDream: Multi-view Diffusion for 3D Generation 2308.16512
Elucidating the Exposure Bias in Diffusion Models 2308.15321
Unified Concept Editing in Diffusion Models 2308.14761
Reinforcement Learning for Generative AI: A Survey 2308.14328
APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency 2308.12605
Diffusion Model as Representation Learner 2308.10916
Backdooring Textual Inversion for Concept Censorship 2308.10718
Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks 2308.10187
AltDiffusion: A Multilingual Text-to-Image Diffusion Model 2308.09991
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization 2308.09889
RLIPv2: Fast Scaling of Relational Language-Image Pre-training 2308.09351
Linearity of Relation Decoding in Transformer Language Models 2308.09124
Watch Your Steps: Local Image and Scene Editing by Text Instructions 2308.08947
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption 2308.08428
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory 2308.08089
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing 2308.07926
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models 2308.07863
Boosting Multi-modal Model Performance with Adaptive Gradient Modulation 2308.07686
A Review of Adversarial Attacks in Computer Vision 2308.07673
Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training 2308.07665
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval 2308.07648