xLSTM: Extended Long Short-Term Memory
Authors: Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter
What
The paper introduces Extended Long Short-Term Memory (xLSTM), a novel recurrent neural network architecture that builds upon the original LSTM by introducing exponential gating with memory mixing and a new memory structure, achieving comparable and in many cases better performance than Transformers and State Space Models in language modeling.
Why
This paper is important because it revives the LSTM for large language models, showing that LSTMs, when properly scaled and enhanced, can compete with and even surpass the performance of dominant architectures like Transformers and State Space Models, potentially impacting various deep learning fields.
How
The authors introduce two new LSTM variants: sLSTM with exponential gating and a scalar memory, and mLSTM with exponential gating and a matrix memory using a covariance update rule. They integrate these variants into residual blocks, stack them to form xLSTM architectures, and evaluate them on synthetic tasks, the Long Range Arena, and language modeling benchmarks (SlimPajama and PALOMA).
Result
The xLSTM architecture outperforms state-of-the-art Transformers, State Space Models, and RNNs in most experiments. Notably, xLSTM excels at sequence length extrapolation, consistently maintaining low perplexity even for longer contexts unseen during training, exhibits superior memory capacity in associative recall tasks, demonstrates strong performance on the Long Range Arena, and achieves state-of-the-art results in perplexity and downstream tasks on both SlimPajama and PALOMA language modeling benchmarks.
LF
Limitations: sLSTM lacks parallelizability due to memory mixing; current CUDA kernels for mLSTM are not fully optimized; mLSTM’s matrix memory has high computational complexity; initialization of forget gates requires careful consideration; longer context sizes might overload the matrix memory. Future work: optimizing CUDA kernels for both sLSTM and mLSTM; exploring alternative memory structures with lower computational complexity; extensive architecture and hyperparameter optimization for larger xLSTM models; application of xLSTM to other deep learning domains beyond language modeling.
Abstract
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.