DeepSeek V3

DeepSeek · December 2024

● activeOpen Weightmixture of expertstext

Parameters671B (37B active)

Context Window128K tokens

Why It Matters

Trained for just $5.5 million, proving frontier performance doesn't require billions in compute. Shook the AI industry's assumption that only big tech could compete.

Description

A 671 billion parameter model (37B active) that matched GPT-4 and Claude 3.5 Sonnet in performance while costing just $5.5 million to train — a fraction of what competitors spent. Used FP8 mixed-precision training (a technique that uses lower-precision numbers to speed up computation without losing quality) and multi-token prediction to achieve frontier results on a budget.

Notable Milestones

▸Matched GPT-4 level performance at a fraction of training cost
▸Caused significant stock market reactions in AI chip companies
▸Demonstrated FP8 training at scale for the first time

Benchmark Scores

MMLUMassive Multitask Language Understanding — 57 subjects

88.5%

HumanEvalCode generation pass@1 — Python problems

82.6%

MATHMATH benchmark — competition-level problems

90.2%

Key Innovations

MoE

MoEArchitecture where only a fraction of the model's parameters are active for each input, allowing massive scale with lower compute.

Open Weight

Open WeightModel weights are publicly released but training data/code may not be. Enables fine-tuning but not full reproduction.

Scaling Laws

Scaling LawsMathematical relationships showing how model performance improves predictably with more data, compute, and parameters.

Family Tree

Related Research (2)

DeepSeek-V2 / MLAArchitecture

2024 · DeepSeek

Introduced Multi-head Latent Attention (MLA), which compresses the key-value cache into a low-rank latent space, dramatically reducing the memory need…

DeepSeek-R1Reasoning

2025 · DeepSeek AI

Demonstrated that pure RL training (without supervised fine-tuning on reasoning traces) can produce chain-of-thought reasoning, achieving performance …