Post

💡 Accelerating LLMs by 2x with Graph-structured Speculative Decoding.

Accelerating LLMs by 2× with Graph-Structured Speculative Decoding

Curiosity: How can we make LLM inference faster? What happens when we use graph structures to optimize speculative decoding?

Researchers have developed Graph-structured Speculative Decoding (GSD), making speculative decoding up to 2× faster by generating multiple hypotheses and merging them into a directed acyclic graph (DAG).

 Accelerating LLMs

Source: https://www.llmwatch.com/p/a-historic-week-for-open-source-ai

Speculative Decoding Overview

Retrieve: Speculative decoding uses a smaller draft model to generate hypotheses validated by the main LLM.

Standard Process:

  1. Draft model generates hypothesis
  2. Main LLM validates hypothesis
  3. Accept or reject tokens
  4. Speedup from parallel validation

Limitation: Single hypothesis path, limited efficiency.

Graph-Structured Speculative Decoding (GSD)

Innovate: GSD generates multiple hypotheses and merges them using a DAG.

Key Insight: Hypotheses often share common token sequences.

GSD Process:

graph TB
    A[Draft Model] --> B[Multiple Hypotheses]
    B --> C[Common Sequences]
    C --> D[DAG Structure]
    D --> E[Merge Sequences]
    E --> F[Main LLM Validation]
    F --> G[2× Speedup]
    
    style A fill:#e1f5ff
    style D fill:#fff3cd
    style G fill:#d4edda

Benefits:

  • ✅ Multiple hypotheses
  • ✅ Efficient merging
  • ✅ Reduced draft model cost
  • ✅ Significant speedup

Performance Results

Retrieve: GSD achieves substantial speedups across LLM sizes.

ModelSpeedupImprovement
LLaMA-2 70B1.73× to 1.96×⬆️ Near 2×
Other LLMsSimilar gains⬆️ Consistent

Speedup Range: 1.73× to 1.96× (nearly 2× faster!)

Architecture Comparison

graph LR
    A[Standard Speculative] --> B[Single Path]
    B --> C[Sequential Validation]
    
    D[GSD] --> E[Multiple Paths]
    E --> F[DAG Merging]
    F --> G[Parallel Validation]
    
    style A fill:#e1f5ff
    style D fill:#fff3cd
    style G fill:#d4edda

Key Advantages

AdvantageDescriptionImpact
Multiple HypothesesGenerate several paths⬆️ Better coverage
DAG MergingEfficient sequence sharing⬇️ Redundant computation
Cost ReductionLess draft model compute⬇️ Costs
Speedup1.73× to 1.96× faster⬆️ Performance

Key Takeaways

Retrieve: Graph-structured speculative decoding achieves 1.73× to 1.96× speedups by generating multiple hypotheses and efficiently merging common sequences using a DAG.

Innovate: By leveraging GSD, you can significantly accelerate LLM inference, reducing computational costs while maintaining quality, making large models more accessible.

Curiosity → Retrieve → Innovation: Start with curiosity about inference acceleration, retrieve insights from GSD’s graph-based approach, and innovate by applying it to speed up your LLM applications.

Next Steps:

  • Explore GSD implementation
  • Test on your models
  • Measure speedup gains
  • Deploy in production
This post is licensed under CC BY 4.0 by the author.