DeepSeek V2

DeepSeek · May 2024

● activeOpen Weightmixture of expertstext

Parameters236B (21B active)

Context Window128K tokens

Why It Matters

Introduced Multi-head Latent Attention, a novel efficiency technique that became foundational to DeepSeek's cost advantage and was later adopted by other models.

Description

A massive 236 billion parameter model that only activates 21 billion parameters per query, thanks to its mixture-of-experts architecture. Introduced Multi-head Latent Attention (MLA) — a new technique that compresses the model's memory of previous text, dramatically reducing the cost of processing long conversations. Offered API access at a fraction of competitors' prices.

Notable Milestones

▸Offered API pricing 10-20x cheaper than GPT-4
▸Pioneered MLA attention mechanism adopted by later models

Key Innovations

MoE

MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.

Open Weight

Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Long Context

Long ContextAbility to process very long inputs (100K+ tokens), enabling analysis of entire codebases or books.

Attention

AttentionThe core innovation of Transformers — allowing each token to 'attend to' every other token to capture relationships.

Family Tree

Built On

DeepSeek V1

Lineage

DeepSeek V1→DeepSeek V2

Successors (2)

DeepSeek V3 DeepSeek-Coder-V2

Related Research (3)

Switch TransformersScaling

2021 · Google

Simplified MoE routing to scale to trillions of parameters efficiently. Influenced Mixtral and GPT-4/5 MoE architectures.

GShardArchitecture

2020 · Google

Scaled Mixture-of-Experts to 600 billion parameters with automatic model parallelism across thousands of TPUs, showing how to train models far beyond …

DeepSeek-V2 / MLAArchitecture

2024 · DeepSeek

Introduced Multi-head Latent Attention (MLA), which compresses the key-value cache into a low-rank latent space, dramatically reducing the memory need…

External Links

Research Paper

More from DeepSeek

DeepSeek V12024-01 · 67B

DeepSeek V32024-12 · 671B (37B active)

DeepSeek R12025-01 · 671B (37B active)

DeepSeek V4 Pro2026-04 · 1.6T

DeepSeek-Coder2023-11 · 1.3B - 33B

DeepSeek-Coder-V22024-06 · 236B (21B active)

PreviousDeepSeek V1

NextDeepSeek-Coder-V2