DeepSeek V3

DeepSeek · December 2024

activeOpen Weightmixture of expertstext
Parameters671B (37B active)
Context Window128K tokens

Why It Matters

Trained for just $5.5 million, proving frontier performance doesn't require billions in compute. Shook the AI industry's assumption that only big tech could compete.

Description

A 671 billion parameter model (37B active) that matched GPT-4 and Claude 3.5 Sonnet in performance while costing just $5.5 million to train — a fraction of what competitors spent. Used FP8 mixed-precision training (a technique that uses lower-precision numbers to speed up computation without losing quality) and multi-token prediction to achieve frontier results on a budget.

Notable Milestones

  • Matched GPT-4 level performance at a fraction of training cost
  • Caused significant stock market reactions in AI chip companies
  • Demonstrated FP8 training at scale for the first time

Benchmark Scores

MMLUMassive Multitask Language Understanding — 57 subjects
88.5%
HumanEvalCode generation pass@1 — Python problems
82.6%
MATHMATH benchmark — competition-level problems
90.2%

Key Innovations

MoE
MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.
Open Weight
Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.
Scaling Laws
Scaling LawsMathematical relationships showing how model performance improves predictably with more data, compute, and parameters.

Family Tree

Built On

Lineage

DeepSeek V1DeepSeek V2DeepSeek V3

Related Research (2)

DeepSeek-V2 / MLAArchitecture
2024 · DeepSeek

Introduced Multi-head Latent Attention (MLA), which compresses the key-value cache into a low-rank latent space, dramatically reducing the memory need…

DeepSeek-R1Reasoning
2025 · DeepSeek AI

Demonstrated that pure RL training (without supervised fine-tuning on reasoning traces) can produce chain-of-thought reasoning, achieving performance …