Asymmetry in Low-Rank Adapters of Foundation Models
Authors: Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz Sáez de Ocáriz Borde, Rickard Brüel Gabrielsson, Leshem Choshen, Marzyeh Ghassemi, Mikhail Yurochkin, Justin Solomon
What
This paper investigates the asymmetry in the roles of adapter matrices in Low-Rank Adaptation (LoRA) for fine-tuning large language models, finding that the matrix projecting input features to a lower dimension (A) plays a less crucial role than the matrix mapping these features to the output (B).
Why
The paper is important because it provides theoretical and empirical evidence for simplifying and improving the efficiency of LoRA fine-tuning, suggesting that using a fixed, randomly initialized A matrix while solely tuning B can lead to comparable or better performance with reduced parameter usage and improved generalization.
How
The authors analyze the asymmetry in LoRA through theoretical analysis of linear regression and nonlinear loss functions, along with empirical evaluations across diverse tasks, including natural language understanding (GLUE, MMLU), generation (XSum, CNN/DailyMail), and image classification (DomainBed) using RoBERTa, BART-Large, LLaMA-2, and Vision Transformer (ViT) models.
Result
The key results demonstrate that: (1) Tuning only the B matrix in LoRA generally outperforms tuning only A, confirming its greater importance. (2) Using a random orthogonal matrix for A while tuning B can achieve comparable or even superior performance to standard LoRA, especially when the rank of B is increased to match the parameter count, suggesting this approach improves parameter efficiency and generalization. (3) The asymmetry and benefits of tuning only B are observed across different models (RoBERTa, BART-Large, LLaMA-2, ViT) and tasks, including language understanding, generation, and image classification, indicating its broad applicability.
LF
The paper acknowledges limitations in the theoretical analysis, which primarily focuses on linear models and single-layer networks, and suggests extending the analysis to more complex and realistic network architectures as future work. Further exploration of the relationship between the random initialization of A and input data distribution is also proposed.
Abstract
Parameter-efficient fine-tuning optimizes large, pre-trained foundation models by updating a subset of parameters; in this class, Low-Rank Adaptation (LoRA) is particularly effective. Inspired by an effort to investigate the different roles of LoRA matrices during fine-tuning, this paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices. Specifically, when updating the parameter matrices of a neural network by adding a product , we observe that the and matrices have distinct functions: extracts features from the input, while uses these features to create the desired output. Based on this observation, we demonstrate that fine-tuning is inherently more effective than fine-tuning , and that a random untrained should perform nearly as well as a fine-tuned one. Using an information-theoretic lens, we also bound the generalization of low-rank adapters, showing that the parameter savings of exclusively training improves the bound. We support our conclusions with experiments on RoBERTa, BART-Large, LLaMA-2, and ViTs.