Training Neural Networks from Scratch with Parallel Low-Rank Adapters

Authors: Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal

What

This paper introduces LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm for training neural networks from scratch using parallel low-rank adapters, addressing the limitations of standard low-rank adaptation in model pre-training.

Why

This paper is important because it tackles the challenge of pre-training large models with limited computing resources by leveraging low-rank adaptations, potentially enabling training on less powerful devices and reducing communication bottlenecks.

How

The authors propose LTE, which trains multiple low-rank adapter heads in parallel on different data shards with infrequent synchronization, merging the updates to the main model weights periodically, and reducing communication overhead.

Result

LTE demonstrates competitive performance compared to standard pre-training across various vision tasks and datasets, achieving comparable accuracy with potential for memory and communication efficiency.

LF

Limitations include slower convergence in the later stages of training and the need for further investigation into optimal hyperparameter selection, such as rank and number of heads. Future work involves exploring dynamic rank and head allocation, heterogeneous LoRA parameterization, and advanced merging strategies.

Abstract

The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.