Exponentially Faster Language Modelling
Authors: Peter Belcak, Roger Wattenhofer
What
This paper introduces UltraFastBERT, a variant of the BERT language model that replaces standard feedforward networks with fast feedforward networks (FFFs). UltraFastBERT achieves comparable performance to BERT on downstream tasks while using only a small fraction (0.3%) of its neurons for each inference.
Why
This work is significant because it demonstrates the potential of conditional neural execution for significant speed improvements in large language models. By showing that only a small portion of neurons are necessary for individual inferences, it challenges the current paradigm of dense computation in these models and opens the door for more efficient implementations.
How
The authors developed UltraFastBERT by replacing the feedforward layers in crammedBERT with FFFs, organizing neurons into a binary tree and conditionally activating only one branch per inference. They trained various UltraFastBERT configurations on the GLUE benchmark, comparing their performance against BERT-base and crammedBERT. They also implemented and evaluated different CPU and GPU inference implementations to assess the speedup from using FFFs.
Result
UltraFastBERT achieved comparable performance to BERT-base on the GLUE benchmark, retaining at least 96% of its performance while using only 0.3% of the neurons for inference. The naive implementation of conditional matrix multiplication (CMM) in FFFs resulted in a speedup of up to 78x on CPUs over standard feedforward layers. While a fully optimized CMM implementation is not yet available, the results highlight the potential for significant speed improvements in language modeling.
LF
The authors acknowledge the limitations in the current implementation of CMM, which relies on high-level linear algebra routines and lacks support for efficient vector-level sparsity. Future work includes developing native and optimized implementations of CMM for both CPUs and GPUs, potentially by introducing hybrid vector-level sparse tensors in deep learning libraries and dedicated device programming interfaces. This would enable fully realizing the potential speedup demonstrated by UltraFastBERT.
Abstract
Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.