DeepSeek V2

DeepSeek · May 2024

activeOpen Weightmixture of expertstext
Parameters236B (21B active)
Context Window128K tokens

Why It Matters

Introduced Multi-head Latent Attention, a novel efficiency technique that became foundational to DeepSeek's cost advantage and was later adopted by other models.

Description

A massive 236 billion parameter model that only activates 21 billion parameters per query, thanks to its mixture-of-experts architecture. Introduced Multi-head Latent Attention (MLA) — a new technique that compresses the model's memory of previous text, dramatically reducing the cost of processing long conversations. Offered API access at a fraction of competitors' prices.

Notable Milestones

  • Offered API pricing 10-20x cheaper than GPT-4
  • Pioneered MLA attention mechanism adopted by later models

Key Innovations

MoE
MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.
Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.
Long Context
Long ContextAbility to process very long inputs (100K+ tokens), enabling analysis of entire codebases or books.
Attention
AttentionThe core innovation of Transformers — allowing each token to 'attend to' every other token to capture relationships.

Family Tree

Built On

Lineage

DeepSeek V1DeepSeek V2

Related Research (3)

2021 · Google

Simplified MoE routing to scale to trillions of parameters efficiently. Influenced Mixtral and GPT-4/5 MoE architectures.

GShardArchitecture
2020 · Google

Scaled Mixture-of-Experts to 600 billion parameters with automatic model parallelism across thousands of TPUs, showing how to train models far beyond …

DeepSeek-V2 / MLAArchitecture
2024 · DeepSeek

Introduced Multi-head Latent Attention (MLA), which compresses the key-value cache into a low-rank latent space, dramatically reducing the memory need…

External Links