Post

Transformers are SSMs

Transformers are SSMs: State Space Duality Framework

Curiosity: Are Transformers and State Space Models fundamentally different? What happens when we discover their deep theoretical connections?

This paper reveals that Transformers and State Space Models (SSMs) are closely related through structured semiseparable matrices. The State Space Duality (SSD) framework enables Mamba-2, which is 2-8ร— faster than Mamba while remaining competitive with Transformers.

Resources:

The Discovery

Retrieve: Transformers and SSMs are more related than previously thought.

Key Finding: These model families are closely related through:

  • Structured semiseparable matrices
  • Various decomposition methods
  • Theoretical connections between SSMs and attention variants

Performance Context

Retrieve: SSMs like Mamba have shown strong performance.

Model TypePerformanceScale
TransformersMain architectureAll scales
SSMs (Mamba)Match/outperformSmall-medium scale

Question: How are they related?

State Space Duality Framework

Innovate: SSD framework reveals deep connections.

graph TB
    A[Structured Semiseparable Matrices] --> B[SSMs]
    A --> C[Attention Variants]
    B --> D[State Space Duality]
    C --> D
    D --> E[Mamba-2]
    E --> F[2-8ร— Faster]
    E --> G[Competitive with Transformers]
    
    style A fill:#e1f5ff
    style D fill:#fff3cd
    style E fill:#d4edda

Mamba-2 Architecture

Retrieve: Mamba-2 improvements enabled by SSD framework.

Core Layer: Refinement of Mambaโ€™s selective SSM

Improvements:

  • 2-8ร— faster than Mamba
  • Remains competitive with Transformers
  • Better theoretical understanding
  • Unified framework

Theoretical Connections

Innovate: Rich framework connecting SSMs and attention.

Connections Through:

  • Structured semiseparable matrices
  • Various decomposition methods
  • Attention variants
  • State space representations

Impact: Unified understanding of both architectures.

Key Takeaways

Retrieve: The State Space Duality framework reveals that Transformers and SSMs are closely related through structured semiseparable matrices, enabling better architectures like Mamba-2.

Innovate: By understanding the theoretical connections between SSMs and attention, we can design more efficient architectures that combine the best of both worldsโ€”Mamba-2 achieves 2-8ร— speedup while maintaining competitive performance.

Curiosity โ†’ Retrieve โ†’ Innovation: Start with curiosity about model architectures, retrieve insights from the SSD framework, and innovate by applying these theoretical connections to design better models.

Next Steps:

  • Read the full paper
  • Understand SSD framework
  • Explore Mamba-2
  • Apply to your models

๐Ÿง™Paper Authors: Tri Daoโˆ—1 and Albert Guโˆ—2 1Department of Computer Science, Princeton University 2Machine Learning Department, Carnegie Mellon University

 State Space Model are semiseparable matrix transformers

Translate to Korean

๊ตฌ์กฐํ™”๋œ ์ƒํƒœ๊ณต๊ฐ„ ์ด์ค‘์„ฑ์„ ํ†ตํ•œ ์ผ๋ฐ˜ํ™”๋œ ๋ชจ๋ธ๊ณผ ํšจ์œจ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์–ธ์–ด ๋ชจ๋ธ๋ง์—์„œ ๋”ฅ ๋Ÿฌ๋‹์˜ ์„ฑ๊ณต์„ ๋’ท๋ฐ›์นจํ•˜๋Š” ์ฃผ์š” ์•„ํ‚คํ…์ฒ˜์˜€์ง€๋งŒ, ์ตœ๊ทผ์—๋Š” Mamba์™€ ๊ฐ™์€ ์ƒํƒœ ๊ณต๊ฐ„ ๋ชจ๋ธ(SSM)์ด ์ค‘์†Œ ๊ทœ๋ชจ์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ์™€ ๋™๋“ฑํ•˜๊ฑฐ๋‚˜ ๋” ์šฐ์ˆ˜ํ•œ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ๋ชจ๋ธ ๊ณ„์—ด์ด ์‹ค์ œ๋กœ ๋งค์šฐ ๋ฐ€์ ‘ํ•˜๊ฒŒ ๊ด€๋ จ๋˜์–ด ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ž˜ ์—ฐ๊ตฌ๋œ ๊ตฌ์กฐํ™”๋œ ๋ฐ˜๋ถ„๋ฆฌ ๊ฐ€๋Šฅํ•œ ํ–‰๋ ฌ ํด๋ž˜์Šค์˜ ๋‹ค์–‘ํ•œ ๋ถ„ํ•ด๋ฅผ ํ†ตํ•ด ์—ฐ๊ฒฐ๋œ SSM๊ณผ ์ฃผ์˜ ๋ณ€ํ˜• ๊ฐ„์˜ ์ด๋ก ์  ์—ฐ๊ฒฐ์— ๋Œ€ํ•œ ํ’๋ถ€ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ฐœ๋ฐœํ•ฉ๋‹ˆ๋‹ค. ์ƒํƒœ ๊ณต๊ฐ„ ์ด์ค‘์„ฑ(SSD) ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด ํ•ต์‹ฌ ๊ณ„์ธต์ด 2-8๋ฐฐ ๋” ๋น ๋ฅธ Mamba์˜ ์„ ํƒ์  SSM์„ ๊ฐœ์„ ํ•œ ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜(Mamba-2)๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ๋™์‹œ์— ์–ธ์–ด ๋ชจ๋ธ๋ง์—์„œ Transformers์™€ ๊ณ„์† ๊ฒฝ์Ÿํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

This post is licensed under CC BY 4.0 by the author.