💡 Accelerating LLMs by 2x with Graph-structured Speculative Decoding.

Posted Aug 3, 2024

By Fodev JEO 1 min read

Link 👉 https://www.llmwatch.com/p/a-historic-week-for-open-source-ai

Researchers have found a way to make speculative decoding up to 2x faster by generating multiple hypotheses and merging them into a directed acyclic graph (DAG).

Speculative decoding is a technique where a smaller “draft” model generates a hypothesis sequence that is then validated by the main large language model (LLM). This can significantly speed up inference.

The key insight is that these hypotheses often share common token sequences.

By using a DAG to manage them, the new method (called Graph-structured Speculative Decoding or GSD) can efficiently predict and merge these recurring sequences.

This drastically reduces the computational cost of the draft model.

When applied to a range of LLMs, including a 70B parameter LLaMA-2 model, GSD achieved speedups of 1.73x to 1.96x compared to standard speculative decoding.

LLM, Graph-Structured Decoding

Accelerating Graph-Structured Decoding

This post is licensed under CC BY 4.0 by the author.

Researchers have found a way to make speculative decoding up to 2x faster by generating multiple hypotheses and merging them into a directed acyclic graph (DAG).

Trending Tags