Post

๐Ÿ’ก Accelerating LLMs by 2x with Graph-structured Speculative Decoding.

 Accelerating LLMs

Link ๐Ÿ‘‰ https://www.llmwatch.com/p/a-historic-week-for-open-source-ai

Researchers have found a way to make speculative decoding up to 2x faster by generating multiple hypotheses and merging them into a directed acyclic graph (DAG).

Speculative decoding is a technique where a smaller โ€œdraftโ€ model generates a hypothesis sequence that is then validated by the main large language model (LLM). This can significantly speed up inference.

The key insight is that these hypotheses often share common token sequences.

By using a DAG to manage them, the new method (called Graph-structured Speculative Decoding or GSD) can efficiently predict and merge these recurring sequences.

This drastically reduces the computational cost of the draft model.

When applied to a range of LLMs, including a 70B parameter LLaMA-2 model, GSD achieved speedups of 1.73x to 1.96x compared to standard speculative decoding.

This post is licensed under CC BY 4.0 by the author.