Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Authors: Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre

What

This paper introduces Hawk and Griffin, two novel recurrent neural network architectures for language modeling that address the scalability limitations of traditional RNNs and offer advantages over Transformers on tasks involving long sequences.

Why

This paper is important as it presents a potential solution to the long-standing challenge of efficiently scaling RNNs for language modeling. The proposed models, Hawk and Griffin, demonstrate competitive performance with Transformers while exhibiting superior efficiency in handling long sequences, which is crucial for various NLP tasks.

How

The authors developed Hawk, a pure RNN model based on the novel Real-Gated Linear Recurrent Unit (RG-LRU), and Griffin, a hybrid model combining RG-LRU with local attention. They conducted scaling experiments, training these models on the MassiveText dataset with up to 300B tokens, comparing their performance to Transformer baselines and state-of-the-art models like Mamba and Llama-2. They analyzed training efficiency on TPUs, inference speed, and capabilities in handling long contexts and performing tasks like copying and retrieval.

Result

Hawk and Griffin demonstrated power-law scaling in training, matching the efficiency of Transformers. Hawk-3B outperformed Mamba-3B on downstream tasks despite being trained on half the data, and Griffin-7B and Griffin-14B achieved comparable results to Llama-2 with significantly less training data. They also exhibited faster inference, especially on longer sequences, due to their smaller memory footprint compared to Transformers. Notably, both models showed superior performance in extrapolating to longer sequences than those seen during training.

LF

The authors acknowledge that while Griffin shows promise in copying and retrieval tasks, more research is needed to match the performance of Transformers in this domain, particularly when evaluating pre-trained models without fine-tuning. Future work could also involve exploring different local attention window sizes for Griffin, potentially dynamically adjusting them based on sequence length and hardware constraints.

Abstract

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.