DeepSeek V2
DeepSeek · May 2024
Why It Matters
Introduced Multi-head Latent Attention, a novel efficiency technique that became foundational to DeepSeek's cost advantage and was later adopted by other models.
Description
A massive 236 billion parameter model that only activates 21 billion parameters per query, thanks to its mixture-of-experts architecture. Introduced Multi-head Latent Attention (MLA) — a new technique that compresses the model's memory of previous text, dramatically reducing the cost of processing long conversations. Offered API access at a fraction of competitors' prices.
Notable Milestones
- ▸Offered API pricing 10-20x cheaper than GPT-4
- ▸Pioneered MLA attention mechanism adopted by later models
Key Innovations
Family Tree
Built On
Lineage
Successors (2)
Related Research (3)
Simplified MoE routing to scale to trillions of parameters efficiently. Influenced Mixtral and GPT-4/5 MoE architectures.
Scaled Mixture-of-Experts to 600 billion parameters with automatic model parallelism across thousands of TPUs, showing how to train models far beyond …
Introduced Multi-head Latent Attention (MLA), which compresses the key-value cache into a low-rank latent space, dramatically reducing the memory need…