LoRA+: Efficient Low Rank Adaptation of Large Models
Authors: Soufiane Hayou, Nikhil Ghosh, Bin Yu
What
This paper investigates the efficiency of Low Rank Adaptation (LoRA) for finetuning large language models and identifies suboptimal feature learning when using the same learning rate for adapter matrices A and B, especially in models with large embedding dimensions.
Why
The paper is important because it provides theoretical insights into the optimal setting of learning rates for LoRA, a widely used technique for efficient finetuning of large language models, and proposes a simple yet effective improvement called LoRA+.
How
The authors utilize scaling arguments, analyzing the behavior of LoRA in the infinite-width limit. They study a simplified linear model and then extend their analysis to general neural architectures with LoRA layers, demonstrating the inefficiency of using equal learning rates for A and B and deriving optimal scaling rules for these learning rates.
Result
The key finding is that setting different learning rates for the LoRA adapter matrices A and B, specifically η_A = Θ(n^-1) and η_B = Θ(1), leads to efficient feature learning in the infinite-width limit. Empirically, they show that LoRA+ with a learning rate ratio of η_B/η_A ≈ 2^4 consistently improves finetuning speed and performance on various tasks and language models, including GPT-2, RoBERTa, and LLama-7b.
LF
The paper acknowledges limitations in precisely determining the optimal learning rate ratio for different tasks and models, suggesting that the ratio is task and model dependent. Future work could involve a more refined analysis to estimate the optimal ratio based on task and model characteristics, potentially leading to further performance improvements.
Abstract
In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA. In our extensive experiments, LoRA improves performance (1-2 improvements) and finetuning speed (up to 2X SpeedUp), at the same computational cost as LoRA.