Speculative Streaming: Fast LLM Inference without Auxiliary Models
Authors: Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, Mahyar Najibi
What
This paper introduces Speculative Streaming, a single-model speculative decoding approach that accelerates large language model inference by fusing drafting into the target model, changing the objective from next-token to future n-gram prediction.
Why
This work is important because it addresses the limitations of traditional speculative decoding methods that rely on separate, resource-intensive draft models, thereby simplifying deployment and improving efficiency for large language model inference, especially on resource-constrained devices.
How
The authors introduce multi-stream attention into the target model for n-gram prediction, enabling parallel speculation and verification of candidate tokens within a single forward pass. They utilize tree-structured drafting for efficient exploration of candidate sequences and employ a pruning strategy based on transition probabilities to manage computational cost.
Result
Speculative Streaming achieves 1.8-3.1X speedup across tasks like summarization, structured queries, and meaning representation without sacrificing generation quality. It also demonstrates comparable or superior performance to Medusa, a recent block-wise decoding model, while using significantly fewer parameters, making it ideal for resource-constrained devices.
LF
The authors acknowledge that the current implementation uses a “hard” matching criterion for draft verification and suggest exploring “soft” matching for potential speedup gains. Future work may involve investigating alternative stream initialization techniques beyond the explored value rotation and dedicated embeddings.
Abstract
Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.